Prefetching Brand-new Documents for Improving the Web Performance

Suyoung Yoon <syyoon@kt.co.kr>
Eunsook Jin <ejin@kt.co.kr>
Jungmin Seo <jmseo@kt.co.kr>
Ju-Won Song <jwsong@kt.co.kr>
Multimedia Technology Research Lab., Korea Telecom
Republic of Korea

Abstract

1. Introduction
2. Related Work
3. Prefetching Brand-new Documents

3.1 Approach
3.2 Collecting Brand-new Documents
3.3 The Effects of Proxies
3.4 Relating with Push technologies

4. Experiments

4.1 Environments
4.2 Results

5. Combining Our Method with Other Prefetching Methods
6. Conclusions
7. References

1. Introduction

Therefore various research efforts have been performed on further increasing the hit ratio. One of them is anticipating future document requests and then prefetching the documents in a local cache before actual client request. In this paper, we propose a new prefetching approach which combines the server's knowledge of brand-new(recently created or modified) documents with client behavior. Previous prefetching methods need to examine and analyze the log file of Web server to predict future client requests. This may result in performance degradation because log files, in general, have enormous number of records. Instead, our brand-new approach does not use the access log but exploits the tendency of web clients to get Web documents.

This paper is organized as follows. In Section 2, we review related work on the Web prefetching. The detail description of our prefetching method is given in Section 3. Section 4 shows experimental results of our prefetching approach followed by a discussion of combining our scheme with other methods. This paper concludes with a summary of results in Section 6.

2. Related Work

The research area of Web document prefetching is rather new and the research is currently being actively performed[2,3,6,8,10,11,12]. Bestavros implemented a system in which the server responses to a client request with, in addition to the requested resource, a list of related resources as monitored by the server[2,3]. Related resources are determined from the resource dependencies based on the previous resource requests to the server. Wachsberg et al. proposed a prefetching scheme based on the past user behavior[12]. They developed a client-based approach whereby each proxy keeps a list of documents needed by its clients, and it decides which of them to prefetch. In comparison to other prefetching approaches, they did not utilize the server information.

Hine et al. investigated ways of exploiting knowledge about client behavior to improve the Web performance[6]. The knowledge they used includes dependencies between resources and general knowledge of each client's behavior. They explored two directions for making use of this information. One of the directions examines combining knowledge of client behavior with resource dependencies to reduce latency by prefetching selected resources. The other direction use knowledge of dependencies to influence cache replacement decisions.

Markatos et al. proposed the Top 10 approach to prefetch documents, which combines the server's active knowledge of the most popular documents with client access profiles[8]. In this mechanism, if a client requests a document, the server sends the requested document and the most popular documents to the client according to the client access profiles. Contrary to the previous proposed sophisticated prefetching algorithms, their approach uses a simple and easy-to-calculate metric.

While most of previous developed prefetching methods need to examine server logs to anticipate clients future requests, our Brand-new approach does not have to access server logs. The detailed description of our proposed prefetching method is given in the next section.

3. Prefetching Brand-new Documents

3.1 Approach

Type 1 : The client is looking for the "best sellers" corner at the shop to find the most popular video tapes.
Type 2 : The client is looking for the "golf training" video tapes because he(or she) likes to play golf and wants to improve his(or her) golf skill with the aid of the training video.
Type 3 : The client is looking for the recently-arrived new video tapes at the shop. He(or She) is a video-watching mania, so there are few of video tapes at the shop that he(or she) has not watched.

We believe that the above classification can be easily applied to Internet users. When a user of type 1 visits internet game sites, he(or she) tries to download the programs that are the most popular. The users of type 1 are usually beginners of that area. The Web sites help the clients find the programs by listing game programs along with their popularities. Top 10 approach[8] is appropriate for the type 1 clients since it prepares popular documents before actual client requests.

Suppose someone visits the 'infoseek' search engine to get information about 'push' technology. If he(or she) enters the word 'push' as the keyword and clicks the 'search' button, the search engine will show the items related to the keyword. Each item listed has a brief description and a link from which detailed information can be obtained. The order of the link is determined by the popularity and quality of the information. Then the client usually clicks the first few items. Bestavros[2,3] and Hine[6] exploited this tendency of type 2 clients. That is, clients of the type 2 usually try to find documents related with a subject.

Finally, let's assume someone visits the 'ESPN' Web site to see today's NBA scores or 'dilbert.com' site to see today's cartoons. Similarly, when someone visits the 'Amazon' bookstore to get some books, he usually clicks the links highlighted by 'New' icons. We catch the idea of prefetching brand-new documents by considering these kinds of Web clients. We believe that frequent users are more likely to get the up-to-date information from the Web.

The key issue of prefetching is anticipating future user needs. But it is difficult to guess the user behaviors, since no program can predict the future. Most of the existing solutions predicting future user requests are based on the previous access patterns of Web clients by analyzing the access log of the web server. Instead, our approach does not use the access log but exploits the tendency of web clients to get Web documents. As in other prefetching methods, we also assume the use of proxies that manage the cache and aggregate requests to the server. Generally, web clients, who frequently use the internet, are trying to get the up-to-date information from the Web. Therefore, in our method, the web server keeps the set of Brand-new documents, SN, which have been modified or created recently. On receiving a request from a client through a proxy, the server sends both the requested document and SN to the proxy. The proxy delivers the requested document to the client and stores SN in the cache using its own caching policy. The size and contents of documents to be prefetched are determined by the cooperation of the client and the server.

Figure 1 shows the basic architecture of our brand-new prefetching approach. Two entities are added to the typical Web client/server environment. Client agent, on the client side, receives brand-new documents from the Web servers and caches them. On the server side, brand-new agent is responsible for collecting the brand-new documents. The circled numbers in the figure represent steps of information or request flow. When a Web client requests a URL, the Web browser makes the http request to the Web server via the client agent(step 1). The client agent catches the request and examines whether the request can be served from its cache of prefetched documents. If client agent finds the document in its cache, it will serve the request from its cache(step 2). Otherwise, it forwards the request to the Web server in the same way of an ordinary http request(step 3). After receiving http request, the Web server sends brand-new documents prepared by brand-new agent as well as the requested document to the client agent(step 4 and 5).

When the client agent receives results from the Web server, it serves the http request initiated by the client(step 6) and also saves brand-new documents from the server into its local cache(step 7). The cache can be operated by one of many existing cache replacement policies. We use LRU for experiments, most popularly used, as the replacement policy.

Client side Server side

Figure 1 : Basic flow of Brand-new prefetching.

The performance of basic implementation can be enhanced by adopting Web caching policies like the piggyback caching algorithm[7]. Using the piggyback approach, when a client makes a http request, client agent piggybacks the list of its cache contents as part of the request. In response, the Web server does not send brand-new documents that are contained in the piggybacked list, this can result in reducing the network traffic.

3.2 Collecting Brand-new Documents

Another method of collecting brand-new documents is to create a special directory for them. When Web server applications add(or modify) documents, they put the documents under this directory. Then, brand-new agent does not have to search for whole directories managed by the Web server, instead, the Web server considers documents under the directory as 'brand-new'.

3.3 The Effects of Proxies

With slight modification of the client agent from the basic architecture of the brand-new approach, our prefetching method can be easily extended to the multi-level proxying structure. On each level, client agent, Ci, cooperates with the proxy to prefetch documents from the next level client agent, Cj, where Cj acts as a server to Ci and a client to the next level proxy.

3.4 The Relation with Push Technologies

In the push environment, documents which are moving through the network are almost recently created or modified. Therefore, our brand-new prefetching method will show much higher performance in this environment.

4. Experiments

4.1 Environments

KTRDG(Korea Telecom Research and Development Group) (rcunix.kotel.co.kr) : Traces from KTRDG's homepage

We know that it is important to use traces from a variety of sources, since different server traces display access patterns from different client bases. But, there is a limitation to evaluate performance of our prefetching approach using the trace-driven simulation. That is, we need additional information, the set of brand-new documents, to simulate our approach. But those documents cannot be obtained from trace logs. To get the set of brand-new documents, we should access whole directories that the Web server manages. We used trace logs during one week of about 500,000 requests. We preprocessed the original traces and removed requests to "cgi" programs.

We use the hit ratio and traffic increase as performance measures in our experimental evaluation. Hit ratio is the ratio of requests that are served from prefetched documents to total requests. The higher this ratio is, the lower the client latency and the server load. Prefetching approaches inherently impose the traffic increase by unsuccessfully prefetched documents. Since no program can predict the future, some of prefetched documents will not be actually requested by clients, and thus should not have been prefetched in the first place.

4.2 Results

Figure 3 reveals that results of the traffic increase as a function of the cache size. The traffic increase is mainly caused by unsuccessfully prefetched documents and the piggyback overhead for delivering the cache information to the server.

Figure 2 : Hit ratio without proxy. Figure 3 : Network traffic.

Figure 4 and 5 plot hit ratios and traffic increase for proxies as a function of the cache size. We see that hit ratio of the prefetching using proxy is raised to about 40% comparing to the Figure 2. But there are only slight changes on the traffic increase.

Figure 4 : Hit ratio using proxy. Figure 5 : Network Traffic using proxy.

5. Combining Our Method with Other Prefetching Methods

    SN = {Set of the first NB elements from brand-new prefetching documents}
         UNION
         {Set of the first NT elements from Top-10 prefetching documents}

          where, NB = NP * HB / (HB + HT)
                 NT = NP * HT / (HB + HT)
                 NP = The total number of documents to be prefetched
                 HB = The hit ratio of the brand-new prefetching approach
                 HT = The hit ratio of the TOP 10 approach

NP can be determined by analyzing client behaviors. For the frequent clients, its value will be large. On the other hand, it will be small for the occasional clients. The HB and HT, which control the prefetching mode, can be also defined as system parameters of the Web server according to the tendency of clients who access the server. If most clients are of type 3, HB will be relatively large.

6. Conclusions

To support prefetching in our mechanism, the HTTP protocol between Web clients and Web servers needs to be enhanced to allow servers to deliver brand-new documents in addition to the requested documents to the client agent.

Our simulation results show that a substantial reduction in latency perceived by clients can be achieved at the cost of a similar increase in the network traffic. Since the retrieval time of a document includes a substantial startup latency, prefetching is more effective in reducing the access time than just increasing the bandwidth.

The brand-new prefetching we propose is specially appropriate for the Web servers such as news web sites like CNN and electronic shopping malls like Amazon, where documents in the server are frequently created or modified and Web clients have tendency to get up-to-date information. In addition, our prefetching method will show much higher performance in the push environment where clients usually want to get the latest information.

We leaves experiments for combining our method with existing prefetching methods as further study.

7. References

M. Abrams, C.R. Standridge, G. Abdulla, S. Williams, and E.A. Fox. "Caching Proxies: Limitations and Potentials," In Proceedings of the Fourth International WWW Conference, 1995.
Azer Bestavros. "Speculative Data Dissemination and Service to Reduce Server load, Network Traffic and Service Time for Distributed Informaion Systems," In Proceedings of ICDE'96:The International Conference on Data Engineering, March 1996.
Azer Bestavros. "Using Speculation to Reduce Server Load and Service Time on the WWW," In proceedings of CIKM'95:The Fourth ACM International Conference on Information and Knowledge management, November 1995
Pei Cao and Sandy Irani. "Cost-aware WWW Proxy Caching Algorithms," In proceedings of the 1997 USENIX symposium on Internet Technology and Systems, December 1997
Lara D. Catledge and James E. Pitkow. "Characterizing Browsing Strategies in the World-Wide Web," In Proceedings of the Third International Conference on the World Wide Web, Darmstadt, 1995.

John H.Hine, Craig E. Wills, Anja Martel, and Joel Sommers. "Combining Client Knowledge and Resource Dependencies for Improved World Wide Web Performance," In proceedings of INET'98:Internet Global Summit, July 1998
Balachander Krishnamurthy and Craig E. Wills. "Study of Piggyback Cache Validation for Proxy Caches in the World Wide Web," In Symposium on Internet Technology and Systems. USENIX Association, December 1997.

Evangelos P. Markatos and Catherine E. Chironaki. "A Top 10 Approach for Prefetching the Web," In proceedings of INET'98:Internet Global Summit, July 1998
Jeffrey C. Mogul. "Hinted Caching in the Web," In Proceedings of the 1996 SIGOPS European Workshop, 1996.

Venkata N. Padmanabhan and Jeffrey C. Mogul. "Using Predictive Prefetching to Improve World Wide Web Latency," Computer Communication Review, 26(3):22-36, 1996.

Joel Sommers and Craig E. Wills. "Prefetching on the Web Using Client and Server Profiles," Technical Report WPI-CS-TR-97-2, Computer Science Department, Worcester Polytechnic Institute, June 1997.
Stuart Wachsberg, Thomas Kunz, and Johnny Wong. "Fast World Wide Web Browsing over Low-bandwidth Links," 1996.

Duane Wessles, "ICP and the Squid Web Cache,"

Lixia Zhang, Sally Floyd, and Van Jacobson. "Adaptive Web Caching," In the 2nd Web Caching Workshop, Boulder, Colorado, June 1997.