Web cache focusing devices

P.P. Sember, T.R. Mueller, N. Baker,
M.C. Flower, B. Raskutti and W. Wen

Telstra Research Laboratories,
Box 249 Rosebank MDC, Clayton Victoria 3168, Australia

p.sember@trl.telstra.com.au, t.mueller@trl.telstra.com.au,
n.baker@trl.telstra.com.au, m.flower@trl.telstra.com.au,
b.raskutti@trl.telstra.com.au and w.wen@trl.telstra.com.au

Caching is vital in the World Wide Web for reducing network bandwidth and providing acceptable quality of service to users. This paper introduces the concept of Cache Focusing Devices for improving performance. This approach is novel in that it attempts to address this problem by influencing the behaviour of users, as opposed to concentrating on architectural aspects such as cache dimensioning and purging strategies. Fundamental to this concept is exposing the cache contents to users, contrary to the current industry trend of cache transparency. Tackling caching performance from this perspective is believed to be crucial since the browsing profile of users intrinsically limits performance improvements.

Caching; Cache directory; Cache transparency; Prefetching

1. Introduction

The explosive growth of the Internet, largely driven by the World Wide Web, has led to congested servers and networks. Unacceptable access delays and unavailability of Web servers regularly confront users. Network administrators constantly face demand to upgrade the capacity of their communication networks. A partial solution to these problems has been the wide deployment of caching software in Web browsers and proxy servers.

Numerous techniques have attempted to improve performance, for example, parent and peer caching [1], cache farms [2], enhanced purging strategies [3], prefetching [4] and geographic push caching [5]. This paper introduces a new technique, known as cache focusing devices, for improving caching performance. This technique attempts to tackle the problem at its source, Web user behaviour, by making the cache contents visible and accessible to users.

2. Cache focusing devices

To date, most proxy server technology has taken a somewhat passive approach to caching, whereby the cache is practically invisible to users. In fact, this is an intended feature of automatic proxy server configuration and the Hypertext Transfer Protocol (HTTP). The effectiveness of such caching is dependent on the coincidence that multiple users have an interest in the same documents. The application of Zipf's Law to Web traffic means that reasonable hit rates are practicable, since a small proportion of popular sites represents the majority of requests [6]. However, the browsing patterns of the users sharing the cache ultimately limits this performance.

The approach proposed here is to make the cache visible to maximise the reuse of each document retrieved from the global Web. By providing tools (cache focusing devices) which enhance existing caching software, the attention of the users can be centred on (or diverted through) the contents in the cache. By comparison, existing resource discovery tools attempt to cover the entire global Internet. Such cache focusing devices can utilise access histories to provide the user with highly relevant content, and has the potential to be automatically personalised. A key advantage for users is that they can discover locally stored and therefore faster content. They can also effectively share information with those of similar interests. Two possible cache focusing devices are the cache directory and popular categories and sites.

2.1. Cache directory

One strategy for improving cache performance is to make the cache visible to users through the development of a cache directory, similar in features to Yahoo [7]. Existing caching systems provide little clue to the user that a resource they request was previously of interest to another user. In fact, improved latency is the only evidence. There is no explicit indication that the resource is part of a larger collection on a particular subject and there is no facility to learn about such collections. The cache directory attempts to rectify this deficiency. The two components are the search engine and hierarchy browser.

A prototype of a cache search engine has been developed for our local cache. This employs a Self Generating Neural Network (SGNN), which is a hierarchical clustering method based on a hybrid of heuristic and neural network technologies [8]. The SGNN technology offers fast searching, high scalability via distribution (of primary importance considering trends towards large caches [9]) and a hierarchical index that can be used to form a category hierarchy. This search engine emulates common search engine features.

The key to the category hierarchy browser is automatic categorisation of all cached documents into a hierarchical index. The technique employed by commercial directories is to manually group documents into predefined categories, for example, "Entertainment" or "Business". However, this is not appropriate for a cache since the natural groupings are not necessarily stable. Moreover, new documents entering the cache must be promptly indexed and assigned to categories due to its dynamic nature.

2.2. Popular categories, sites and pages

Extensions to existing caching systems could notify users about the resources that are popular, in a manner analagous to the ratings of television programmes. Users are presented with the top ten most popular categories (linked to the category hierarchy of the directory), sites (host names) and single pages. Thus, this interface will enable unrelated groups of users to effectively share resources without much additional effort. From a user's perspective, this information should be relatively reliable and relevant, because of its basis on popularity. A prototype system has been developed for our local cache.

Initially, such a system may be effective at providing an interesting environment of resource discovery. However, basing the information solely on popularity leads to a loss of regular appeal if the most popular resources are constant, for example, core documents in a technical field. Cache prefetching is a possible solution to overcome this limitation. This technique involves the automatic discovery of new resources for each category that the cache retrieves before any user requests them. The advertisement of these new documents is crucial to secure any real benefit, as is the precision of the resource discovery. Due to the inaccuracy of existing search engines, the verification of results is essential, for example, by comparing their summaries with the existing resources.

2.3. Additional opportunities

There are numerous other opportunities that could extend the ideas proposed in this paper, utilising the cached content, access history and presentation in different ways. These applications could support the growing industry trend of providing personalised services, community-oriented facilities and automated delivery of information.

Many Web sites offer personalised e-mail services for obtaining reports about relevant changes. Since the cached content is a dynamic store of documents, a similar notification system is applicable. Further, since the proxy server maintains detailed information about user browsing behaviour, it would be possible to develop an automatic profiling and recommendation system. Push technologies might provide an alternative to e-mail.

Advanced graphical interfaces and visualisation technologies are applicable to the problem of viewing the cache index. Individual users or groups could be provided with customised views of the cache based on their access profiles or other submitted criteria, for example, school assignment topics.

3. Conclusion

This paper has introduced the concept of Cache Focusing Devices for improving the performance of proxy server caching. The concept is novel in that it attempts to address the problem by modifying user behaviour, as opposed to concentrating on architectural aspects such as cache dimensions and purging strategies. In addition, it exposes the cache to users, which is contrary to current industry trends. Tackling caching performance from the perspective of modifying user behaviour is crucial, considering that all performance improvements (achieved through architectural refinement) are limited by the performance levels attributed to the user access profiles.

Numerous technical issues arise when considering this concept of a visible cache, such as the integration of the system with existing Web directory services and assurance that the presented content remains cached. Other critical and sensitive issues to be addressed include privacy, copyright, censorship, security and billing.


The permission of the Director, Telstra Research Laboratories, to publish this material is hereby acknowledged.


[1] A. Chankhunthod et al., A hierarchical Internet object cache, in: Proceedings of USENIX 1996 Annual Technical Conference, San Diego, California, January 1996, http://catarina.usc.edu/danzig/cache.ps

[2] Cisco Cache Engine, Cisco Systems, 1997, http://www.cisco.com/warp/public/751/cache/cds_ov.htm

[3] S. Williams et al., Removal policies in network caches for World Wide Web documents, in: Proceedings of ACM Sigcomm96, Stanford University, California, August 1996.

[4] M. Nabeshima, The Japan cache project: an experiment on domain cache, in: Proc. of the 6th International World Wide Web Conference, Santa Clara, California, April 1997, http://www6.nttlabs.com/HyperNews/get/PAPER21.html

[5] J. Gwertzman and M. Seltzer, The case for geographical push-caching, in VINO: The 1994 Fall Harvest, Technical Report TR-34-94, Center for Research in Computing Technology, Harvard University, December, 1994.

[6] Cache object popularity, NLANR, http://ircache.nlanr.net/Cache/Statistics/Popularity-Index

[7] Yahoo!, http://www.yahoo.com.au

[8] W. Wen, A. Jennings and H. Liu, Learning a neural tree, in: Proceedings of IJCNN'92: International Joint Conference on Neural Networks, Beijing, November, 1992, pp. 751–756.

[9] Mirror image Web caching solution, Mirror Image Internet, 1997, http://www.mirror-image.com/cachesol.htm