Missing the 404: link integrity on the World Wide Web


Helen Ashman (University of Nottingham) and Hugh Davis (University of Southampton)


Helen Ashman (University of Nottingham)


Hugh Davis (Multicosm Ltd and University of Southampton)
Jim Whitehead (University of California, Irvine)
Steve Caughey (University of Newcastle-Upon-Tyne)

The management of electronic document collections relies partly on the persistent naming of all addressable objects in the collection. Very often we find that objects are moved or deleted, resulting in the well-known "Error 404 – file not found". This happens frequently in the specification of hypertext links, and is called the "dangling link problem".

Given the importance of the Web, the commercial interests in the Web, and the amount of technical expertise that has been channeled into the Web, it might be considered astonishing that this problem has not been solved. Perhaps this is not a technical problem? There is no guaranteed integrity of links from a catalogue in a library to the books on the shelf, but rather we rely upon responsible social behavior and we are prepared to tolerate some deficiencies in the system. Maybe this has been the case with the Web? Or maybe broken links are just the Web's way of "forgetting" as objects cease to be of interest?

But is this situation acceptable in the future? As sites constantly evolve and re-organize enormous numbers of links begin to dangle as people either ignore or are unable to carry out the social conventions of informing people of what has happened to a page. The problem is not trivial. In the March 1997 Scientific American, Brewster Kahle claims that the average lifetime of a URL is just 44 days. The panel will address this issue.

Technical responses to the dangling link problem may be divided into three categories: detection of broken links, correction of broken links and avoidance of broken links. The panel will address all three, considering the pro's and con's of both current and possible future applications of these approaches.

Detection: Many tools exist for enumerating all the links within some domain or file system and checking that the destination of the URI is present. These tools may exist in the form of spiders, or be tied into document management systems of some kind.

Where the source documents for such links are under our ownership we may then, of course, choose to correct or delete the broken links.

Correction: When we move or delete a document, we have three ways of behaving responsibly; we can leave some form of forward reference to the new document, we can inform anything that points at us of the change, or we can update some name server through which our document is accessed. But only the owner(s) of a document can change the references themselves.

Avoidance: The hypertext community has long addressed the problems of link integrity, and has produced a number of link server systems and "hyperbase" systems which store links and content in a tightly coupled schema, which ensures integrity. The question is, to what extent are these architectures scaleable? The document management community has produced a number of tools for publishing to an intranet, which guarantee integrity, at least at the time of publication. In the XLL proposals, the WWW community has been considering new models for hypertext links on the Web. Perhaps these are the solution?

Other technical approaches to solving the link integrity problem include using long-lifetime names, such as Uniform Resource Names (URNs) and development of an Internet-scale notification message system. URNs prevent link integrity problems from forming, since the resource being linked-to has a long-lifetime name. A notification service aids detection of link integrity problems by sending out notification messages when a resource has been deleted.


Both the hypertext community and the Web community have developed different but complementary approaches to the problem of ensuring link integrity. One of the primary purposes of this panel is to bring together link integrity experts from both fields so that we can learn from each other's experiences and perhaps together develop new and useful approaches to link management.

This panel will be of interest to server managers, Web masters and technical decision makers who are concerned by the problem of dangling links in the Web. Inevitably there is a considerable technical content to the solution to these problems.


Helen Ashman, The University of Nottingham, U.K.
Helen has been part of the hypertext community since the Australian Department of Defence posted her to the University of Southampton in 1992. More recently she has been involved in the Web community, adapting her doctoral research to the Web and being involved in WWW7 and the regional Ausweb conference series as joint programme chair.


Hugh Davis, Multicosm Ltd and the University of Southampton, U.K.
Hugh Davis is a lecturer at the University of Southampton and was a founder member of the Multimedia Research Group and the Microcosm open hypermedia project. He is also a director of Multicosm Ltd which produce knowledge management tools, in particular the link service systems, Microcosm and Webcosm. He was general co-chair of the ACM Hypertext '97 Conference.

Jim Whitehead,University of California. Irvine.
Jim Whitehead is the founder and chair of the IETF WEBDAV (Distributed Authoring and Versioning) working group, which is developing extensions to HTTP for remote authoring of Web content. Jim is also a Ph.D. student at the University of California, Irvine, where his is a member of the Software research group, from which he received his M.S. in Information and Computer Science in 1994. He has published in both the WWW and hypertext research

Steve Caughey, University of Newcastle-Upon-Tyne, U.K.
Steve Caughey is a research associate within the Arjuna (distributed, fault-tolerance) group at University of Newcastle-Upon-Tyne. He has worked for a number of telecommunications companies before returning to research in 1989. His interests include distributed garbage collection and referential integrity and he has been a regular contributor to the WWW conference series on these and other topics.