A Smart Internet Caching System

Gihan V. Dias <gihan@cse.mrt.ac.lk>
Graham Cope <gacope@cse.mrt.ac.lk>
Ravi Wijayaratne <raviwija@cse.mrt.ac.lk>
Dept. of Computer Science and Engineering
University of Moratuwa, Sri Lanka.

Abstract

We describe a "smart" caching system which combines the strengths of mirrors and caches to reduce the response time of Internet applications.

Many services on the Internet, such as the File Transfer Protocol (FTP), Usenet and the Web, can be modeled as a set of files, stored on a set of servers, which are accessed from a set of clients. Techniques used to reduce access time, and the load on servers and the network, is to store copies of files on other servers, called mirrors or caches. However, both of these methods have their weaknesses.

We have designed an intelligent caching system, which models the access patterns of files (such as Web pages), and attempts to fetch them before they are requested by a client. It comprises a caching gateway which is split between the two ends of a low-speed link. File access patterns are modeled probabilistically as well as heuristically.

1. Introduction

Many services on the Internet, such as FTP, Usenet and the Web, can be modeled as a set of files, stored on a set of servers, which are accessed from a set of clients. One of the major problems facing the Internet today is the long response times frequently encountered by users. This is due to:

the load on the servers containing the documents; and
the load on the communication lines through which the documents are transmitted [1].

The problem is especially pronounced for sites which are connected to the rest of the Internet by a relatively low-speed link [2]. Many users, such as clients of geographically isolated Internet service providers, small organizations, and dial-up users, have significant limitations on bandwidth. Although the availability of high-speed links is increasing, the increased traffic volumes generated by new applications results in these links remaining congested.

Our system attempts to reduce the overall latency of file access for users accessing the Internet over a low-speed, heavily loaded link. The latency of such a system is composed of three components:

The latency between the client and the local end of the link.
The latency between the two ends of the link, due to low link speed and link congestion.
The latency between the server and the remote end of the link, due to the response time of the server, and congestion elsewhere on the Internet.

A major feature of our system is the use of a split gateway with caches on both ends of the slow link. This allows us to provide services which are not feasible with only a single cache. The techniques employed are:

Caching of recently accessed files.
Pre-fetching of files which may be accessed shortly. (The selection of files to pre-fetch is based on the currently referenced file. We use both a general model of file access patterns, and the history of access patterns of individual files).
Deferment of loading large and/or low priority files.
Automatic updating of frequently used pages during off-peak times.
File compression and differencing. (If an updated file is similar to its previous version, only the differences between the files are transmitted).

2. Background and previous work

Two techniques generally used to implement proxies are mirroring, and caching.

A mirror is a server which stores all or part of the files from one or more other servers. Clients connect to the mirror and obtain files from it instead of the primary site. Mirroring is popular for FTP servers, and most major FTP servers have a series of mirror sites in various geographic locations.

Mirroring suffers from a number of drawbacks, namely:

Files are mirrored only at periodic intervals. Therefore, some files in the mirror may be out of date.
Many mirrored files will rarely, or never, be accessed. Therefore, bandwidth used for copying files from the primary to the mirror servers, and storage space on the mirrors, are wasted.
Users may not be aware of the best mirror site to use, and may access inappropriate sites, nullifying any bandwidth savings.
A mirror can only store a small fraction of the files on the Internet. Therefore users cannot depend on the mirror for access to all the files they need.

To be effective, a mirror should contain a good proportion of the frequently accessed files, be connected to the clients by high-speed connections, and be well publicized throughout its intended audience. One site which has done this fairly successfully is the Hensa Unix site in the U.K. [3].

Caching is widely used to improve response times in the World Wide Web (WWW). A cache is usually configured as a proxy and all file accesses are channeled through it. The cache stores local copies of files which have been accessed recently. Therefore, the access times of frequently accessed files will decrease, as they are served locally.

The CERN HTTPD server [4] was one of the first to introduce a proxy cache. The Harvest[5] caching system introduced a set of hierarchical caches, with requests being forwarded from the local cache to other caches. A cache can also be run on a Web server, which then acts as an accelerator, speeding up the responses from that server. An extensive network of hierarchical caches can be expected to reduce response times and server loads dramatically. The Netscape HTTPD server also incorporates a cache.

3. System Model

3.1. The Model

The system views the Internet as a set of clients which are connected to the rest of the world via a low-speed link. Servers containing files for applications such as WWW are located on the far side of the link. All accesses to the servers (for these applications) are channeled through a gateway. The gateway comprises two parts, the local gateway on the client side of the link, and the remote gateway on the server side of the link, as shown in Figure 1.

Figure 1: The system model

3.2. Assumptions

The client side has a single gateway to the rest of the Internet through which all HTTP requests are processed.
The local gateway is connected via a low bandwidth link (e.g. 64 kbits/sec) to a remote gateway.
There is only one local gateway connected to the remote gateway.
Any non-WWW traffic on the link is ignored.
In implementing a system, we have no control over any software or systems past the remote gateway. In particular, we cannot modify WWW server software.

3.3. Performance-enhancing measures

A number of techniques are used to improve performance.

A conventional cache

The cache on the local gateway stores recently accessed files. Each cached file is assigned a time-to-live, (TTL). If a requested file is in the cache, and its TTL has not expired, it is retrieved from the cache. If it has expired, the server is checked (via the remote gateway) to see if it has been modified.

File pre-fetching

Users do not generally access files at random. Although the access pattern of a user is not deterministic, we can obtain a good idea of the files likely to be next accessed based on the currently accessed file. For example, for a Web page, any in-line images will be accessed immediately with a high probability if the load in-line images option is set in the browser. Links in the document, e.g., the next page link also have a fairly high access probability.

We call this set of files with a high access probability, the neighborhood of a file. The neighborhood is computed using:

the previous access patterns for the specific document (e.g. of the previous 100 users who looked at page xxx, 50 then requested yyy and 40 requested zzz, therefore it is a good assumption that this user will want yyy or zzz),
known characteristics of HTML (e.g., in-line images), and
general behavior/access patterns (e.g. most users request one of the documents returned by a search).

Ideally, neighborhood information for each document should be maintained with the document itself. However, it is not feasible for us to modify WWW server software. Therefore this information must be maintained within our system, either at the remote or local gateways

Deferred file loading

When a user selects a large file during a busy time, then the priority of the file is reduced, and deferred to a less busy time such as at night.

Automatic updating

The system intelligently tracks when files expire and become updated (e.g. newspapers are updated once per day, and share prices change faster). Frequently read files are updated automatically when traffic levels are low.

File compression and differencing

Standard compression algorithms are used between the two gateways [6].

Additionally a differencing technique is used when loading updated documents. Many documents, when updated, are not revised completely. This is exploited and only the differences are transmitted.

3.4. Types of document requests

In order to accommodate the above performance-enhancing measures, three type of requests for documents are made by the local gateway:

User requests: an actual request for a WWW document initiated from a user.
Predicted request: a request made for a document, for which the user has not made a request, but for which they may access in the near future. This would have a lower priority than a user request.
Update requests: for frequently accessed pages. These requests have low priority.

4. System design

Figure 2 shows the top-level data flow diagram, [7], for the caching system.

The local gateway receives requests from the client and passes them to the remote gateway as needed. The remote gateway issues requests to servers, and manages the data flow over the link. Both the gateways run algorithms for pre-fetching files. The local gateway requests files based on the previous access patterns of the current document, while the remote gateway requests files based on the contents of the file and general access patterns for that type of file.

In addition to the functions shown in Figure 2, it is also necessary to consider an abort request. The system in the diagram above can be readily extended to show this, but it has been omitted in order not to distract from the key points.

Figure 2: Level 1 DFD for the Caching System

4.1. Algorithms for the local gateway

The prime responsibilities of the local gateway are to:

Record all user requests, and maintain a history information of previous requests.
Maintain (i.e. update and purge) a cache of recently loaded documents.
Decide which documents are required for the predicted and update requests.
Transmit predicted and update requests for documents to the remote gateway.
Receive and decompress documents.
Forward documents to users.

4.1.1. Main processes

Record request

Receive user request, and store in log (user's IP address, URL, date, and time).

Instruct Send Document process to prepare to send a document to the user(s) from local cache.

Request document

A requested document can be in one of three states:

No document or incomplete document present (request document, at high priority. If a request has already been made, then increase the priority).
Complete, up to date, document is already in the cache (do nothing).
Expired document present (request if document modified).

Modify the priority of predicted requests made on the basis of the old neighborhood (Rationale: The neighborhood has changed following a new request).

Determine the neighborhood for the new request (from the history information), and issue requests to load documents within that neighborhood.

Check whether any documents have expired, and if so, ask for updates.

Send document

Receive instructions from Record Request to send document.

Check whether any part of a valid (i.e. up-to-date) copy of the document is in the cache. If so, transmit to all users who requested it.

Issue messages to the user informing of the deferment of large documents.

Purge cache

Remove pages on the basis of last requested time and expiry date.

4.1.2. Request document process

The request document process is the most complex within the local gateway. Further details of its operation are shown in Figure 3, and key processes are described below.

Figure 3: DFD for Local Gateway

Determine/change neighborhood

Requests are assigned a priority, between the values 0 (lowest priority) and N (highest priority), where N is a system parameter. An actual user request will be assigned top priority N. For each document the history log will store a parameter which shows how often document i was requested following that document.

The function for assigning priorities to a pre-empted request for document i should:

return an integer value between 0 and N-1,
be monotonically increasing with respect to . (The higher the proportion of previous requests made for that page, then the more likely that the next request will be for that page).
be monotonically increasing with respect to (the more requests made, then the greater our knowledge of history, and hence the greater confidence which we have in the prediction).

Maintain history

A required administrative task is to periodically reset counters for the history information. They are updated by halving and rounding down, and removing any redundant entries.

Send update requests

If a document is frequently read, and changes periodically (e.g. a newspaper changes daily) then it is a suitable candidate for being updated. Examining the history file and access log will determine how often the document changes, when it changes, and how often it is read. From these statistics the system estimates whether it is worth updating a document.

4.2. Algorithms for the remote gateway

The prime responsibilities for the remote gateway are to:

Receive requests from the local gateway.
Request and load documents from WWW servers or other caches (on the basis of the priority of the received requests).
Use heuristics to predict which other documents might soon be required on the basis of the contents of requested documents (Note: the local gateway uses specific history information).
Compress and transmit documents to the local gateway (on the basis of their priority). Sometimes these documents will not have yet been requested, in which case a push or forced send is required.

Figure 4 shows the main processes within the remote gateway.

4.2.2. Overall priority

Suppose that there is more than one request for a document. The overall priority of that request is calculated. Suppose that request i for document has priority (between 0 and N). Then the overall priority for that document (for deciding whether to loaded and forward it) is defined to be:

overall_priority = N * {1 - (1 - pi / N)}

This formula has the desirable properties that:

if and only if there is a request at top priority N, then the overall priority is N,
the overall priority is greater than or equal to each individual priority level.

Figure 4: DFD for the Remote Gateway

5. System protocols

5.1. Overview of the protocols

The protocols used to transfer messages between the components of the system are shown in Figure 5.

Figure 5: Overview of the protocols

Client HTTP requests[8] are received by the proxy at the local gateway over connection 1. The filtering process filters out the requests which cannot be handled by the gateway (non-relevant requests). These are forwarded directly to the server over connection 3. Relevant requests are dispatched to the remote gateway using the gateway protocol.

Requests are processed by the Relay Process at the remote gateway. If the requested object is not in the remote cache it will set up an HTTP connection to the server to fetch the document (Connection 4). Once the document is available, the Record Document process will read it and the Send Document process will transfer it to the Local Gateway.

TCP connections 1, 3, and 4 are formed on request. TCP connection 2 is permanently formed.

5.2. The gateway to gateway relay protocol (GGRP)

The function of the GGRP [9] is to transfer requests from the local to the remote gateways and data and status information in the opposite direction. The protocol is modeled as a finite state machine.

5.2.1. The message structure and the operation of GGRP

The messages are divided into blocks of maximum size 256 bytes. Messages are multiplexed on the channel on the basis of the priority of the request (which may change dynamically). This process also gives priority to small objects over large objects.

As a reply is transferred in blocks, the reply message contains information to reassemble the full object at the destination. Flags identify the first and last blocks of a file. All blocks contain request ID and the byte offset from the beginning of the file to reconstruct the object at the local gateway.

5.2.2. The interface to the protocol

The protocol provides the following interface to the higher-level processes shown in Figure 2:

At the local gateway:

Send_request: Add a URL request to the local request queue.

Receive_data: At least the first part of a file has been received by the local gateway.

Change_priority: The priority of a previous request is changed. The cancellation of a previous request is achieved by a message with the lowest priority.

Receive_status: Indicates an error or other condition at the remote gateway.

At the remote gateway:

Receive_request: Indicates a file is requested by the local gateway.

Send_data: Initiates the transfer of a document to the local gateway.

Receive_priority_change: Indicates a new priority for a previous request.

Send_status: Notifies the local gateway of errors, etc.

6. Dimensioning and performance issues

6.1. Pre-fetching

A major feature of our system not present in conventional caching systems is the pre-fetching of files in anticipation of user requests. This will cause an increase in the load on the link for transmitting possibly un-needed files, which may seem unwarranted if the link is already heavily loaded. However, it provides performance improvements for two reasons.

Pre-fetch requests are at a lower priority than user requests. Therefore, they never cause delays of files fetched for user requests. We observe that unless a link is heavily congested, there are intervals during which it is idle. Pre-fetched files can be transmitted during these intervals.
Pre-fetched files are cached on the remote gateway. Therefore, when they are requested by a user, they can be transmitted immediately, without the server access delay. This significantly speeds access to slow servers, especially for in-line images.

6.2. Link overhead

The communication between the gateways, especially the messages regarding changes in the neighborhood, causes a load on the link. However, since most of this traffic is from the local to the remote gateway, and most of the data flows in the opposite direction, this does not have a significant effect on performance. Additionally, requests are aggregated to reduce the load on the link.

6.3. Cache and history size

Ideally, the local cache should store all files ever accessed by clients, and the remote cache should store these plus all pre-fetched files. In practice, the cache may store only several days worth of requested files, and only a few minutes worth of pre-fetched files.

The history list, however, needs to have many more entries; we suggest at least six months worth. This allows us to predict access patterns even for files which have been purged from the cache. Keeping history information for 1 million URLs (about six months worth) would only require on the order of 100MB of disc space, which can be considered reasonable.

7. Conclusions and future work

7.1. Conclusions

The current prototype does not yet enable us to draw hard conclusions about the methods described above. Detailed analysis and experimentation is required in order to determine the relative strengths and weaknesses of each of the techniques outlined in this paper. Having demonstrated their feasibility, a product could be developed for wider implementation.

There are, however, a number of other ways in which the system could be extended, and these are described below.

7.2. Topological extensions to the system

The most natural extensions to make to the system are to:

connect one remote gateway to multiple local gateways,
connect one local gateway to multiple remote gateways,
chain systems together in a hierarchy or a network.

During the design phase attempts have been made to ensure that the system is extendible to the above scenarios.

7.3. Optimizing system parameters

Extensive experimentation and analysis is required in order to set system parameters and optimize algorithms. Examples include:

the function for determining the priority of a predicted request.
the algorithm for determining lowest_p, the priority for requests at the remote gateway.

7.4. Protocol and functional improvements

Introduce a protocol to exchange history and user behavior among caches and to and from servers.

The protocol should be extended to handle applications beyond WWW such as News and FTP.

Current compression algorithms mainly concentrate on text. Further algorithms are required for high-resolution images, audio and real-time data.

The system should keep track of individual user's behavior. Examples include,

the proportion of in-line images which each user loads could be included in the prediction algorithm,
pre-fetching of documents under the user's control (at low priority),
not loading, or loading at only low priority, parts of documents which are not yet required.

References

Tim Berners-Lee, "Propagation, Replication and Caching on the Web", http://www.w3.org/pub/WWW/Propagation/activity.html
M.S.D. Fernando et. al., "Multimedia Message Distribution in a Constrained Environment", Proc. INET 95.
"The U.K. National Web Cache at HENSA Unix", http://www.hensa.ac.uk/wwwcache/
"W3C httpd", http://www.w3.org/pub/WWW/Daemon/
Anawat Chankhunthod et. al., "A Hierarchical Object Cache", ftp://ftp.cs.colorado.edu/pub/-techreports/schwartz/HarvestCache.ps.Z
Mark Nelson, The Data Compression Book, M&T Books, 1991.
James Rumbaugh et. al., Object Modelling and Design, Prentice Hall, 1991.
W3 Consortium, "Hypertext Transfer Protocol", http://www.w3.org/pub/WWW/Protocols
Gihan Dias, Graham Cope & Ravi Wijayaratne, "The Design of a Smart Caching System", Technical Report, Dept. of Computer Science of Engineering, University of Moratuwa, 1996.

Authors

Graham Cope Studied mathematics at Cambridge, England. He then went on to do an M.Sc. in information technology at Imperial College London, which he received in 1987. Between 1987 and 1994 he worked at BT's research laboratories in the Performance Engineering Division devising and modelling a number of network design, modelling and optimization tools and techniques. Since 1994, he has been a lecturer at the University of Moratuwa, Sri Lanka, specializing in computer networks and software engineering.

Gihan Dias graduated from the University of Moratuwa in 1985 and received an MS and Ph.D. from the University of California. He is currently senior lecturer at the University of Moratuwa, and the director of the Lanka Academic and Research Network. He is also a director of Information Laboratories (Pvt) Ltd. His research interests are in distributed computing systems and computer networks.

Ravi Wijayaratne graduated from the University of Moratuwa in 1995 and worked for a year at Lanka Internet Services as a systems engineer before joining the university staff as a lecturer. His research interests are computer networks and distributed computing.