Suyoung Yoon <syyoon@kt.co.kr>
Eunsook Jin <ejin@kt.co.kr>
Jungmin Seo <jmseo@kt.co.kr>
Ju-Won Song <jwsong@kt.co.kr>
Multimedia Technology Research Lab., Korea Telecom
Republic of Korea
Therefore various research efforts have been performed on further increasing the hit ratio. One of them is anticipating future document requests and then prefetching the documents in a local cache before actual client request. In this paper, we propose a new prefetching approach which combines the server's knowledge of brand-new(recently created or modified) documents with client behavior. Previous prefetching methods need to examine and analyze the log file of Web server to predict future client requests. This may result in performance degradation because log files, in general, have enormous number of records. Instead, our brand-new approach does not use the access log but exploits the tendency of web clients to get Web documents.
This paper is organized as follows. In Section 2, we review related
work on the Web prefetching. The detail description of our prefetching
method is given in Section 3. Section 4 shows experimental results of our
prefetching approach followed by a discussion of combining our scheme with
other methods. This paper concludes with a summary of results in Section
6.
The research area of Web document prefetching is rather new and the research is currently being actively performed[2,3,6,8,10,11,12]. Bestavros implemented a system in which the server responses to a client request with, in addition to the requested resource, a list of related resources as monitored by the server[2,3]. Related resources are determined from the resource dependencies based on the previous resource requests to the server. Wachsberg et al. proposed a prefetching scheme based on the past user behavior[12]. They developed a client-based approach whereby each proxy keeps a list of documents needed by its clients, and it decides which of them to prefetch. In comparison to other prefetching approaches, they did not utilize the server information.
Hine et al. investigated ways of exploiting knowledge about client behavior to improve the Web performance[6]. The knowledge they used includes dependencies between resources and general knowledge of each client's behavior. They explored two directions for making use of this information. One of the directions examines combining knowledge of client behavior with resource dependencies to reduce latency by prefetching selected resources. The other direction use knowledge of dependencies to influence cache replacement decisions.
Markatos et al. proposed the Top 10 approach to prefetch documents, which combines the server's active knowledge of the most popular documents with client access profiles[8]. In this mechanism, if a client requests a document, the server sends the requested document and the most popular documents to the client according to the client access profiles. Contrary to the previous proposed sophisticated prefetching algorithms, their approach uses a simple and easy-to-calculate metric.
While most of previous developed prefetching methods need to examine
server logs to anticipate clients future requests, our Brand-new approach
does not have to access server logs. The detailed description of our proposed
prefetching method is given in the next section.
We believe that the above classification can be easily applied to
Internet users. When a user of type 1 visits internet game sites, he(or
she) tries to download the programs that are the most popular. The users
of type 1 are usually beginners of that area. The Web sites help the clients
find the programs by listing game programs along with their popularities.
Top 10 approach[8] is appropriate for the type 1 clients since it prepares
popular documents before actual client requests.
Suppose someone visits the 'infoseek' search engine to get information about 'push' technology. If he(or she) enters the word 'push' as the keyword and clicks the 'search' button, the search engine will show the items related to the keyword. Each item listed has a brief description and a link from which detailed information can be obtained. The order of the link is determined by the popularity and quality of the information. Then the client usually clicks the first few items. Bestavros[2,3] and Hine[6] exploited this tendency of type 2 clients. That is, clients of the type 2 usually try to find documents related with a subject.
Finally, let's assume someone visits the 'ESPN' Web site to see today's NBA scores or 'dilbert.com' site to see today's cartoons. Similarly, when someone visits the 'Amazon' bookstore to get some books, he usually clicks the links highlighted by 'New' icons. We catch the idea of prefetching brand-new documents by considering these kinds of Web clients. We believe that frequent users are more likely to get the up-to-date information from the Web.
The key issue of prefetching is anticipating future user needs. But it is difficult to guess the user behaviors, since no program can predict the future. Most of the existing solutions predicting future user requests are based on the previous access patterns of Web clients by analyzing the access log of the web server. Instead, our approach does not use the access log but exploits the tendency of web clients to get Web documents. As in other prefetching methods, we also assume the use of proxies that manage the cache and aggregate requests to the server. Generally, web clients, who frequently use the internet, are trying to get the up-to-date information from the Web. Therefore, in our method, the web server keeps the set of Brand-new documents, SN, which have been modified or created recently. On receiving a request from a client through a proxy, the server sends both the requested document and SN to the proxy. The proxy delivers the requested document to the client and stores SN in the cache using its own caching policy. The size and contents of documents to be prefetched are determined by the cooperation of the client and the server.
Figure 1 shows the basic architecture of our brand-new prefetching approach. Two entities are added to the typical Web client/server environment. Client agent, on the client side, receives brand-new documents from the Web servers and caches them. On the server side, brand-new agent is responsible for collecting the brand-new documents. The circled numbers in the figure represent steps of information or request flow. When a Web client requests a URL, the Web browser makes the http request to the Web server via the client agent(step 1). The client agent catches the request and examines whether the request can be served from its cache of prefetched documents. If client agent finds the document in its cache, it will serve the request from its cache(step 2). Otherwise, it forwards the request to the Web server in the same way of an ordinary http request(step 3). After receiving http request, the Web server sends brand-new documents prepared by brand-new agent as well as the requested document to the client agent(step 4 and 5).
When the client agent receives results from the Web server, it serves the http request initiated by the client(step 6) and also saves brand-new documents from the server into its local cache(step 7). The cache can be operated by one of many existing cache replacement policies. We use LRU for experiments, most popularly used, as the replacement policy.
Figure 1 : Basic flow of Brand-new prefetching.
The performance of basic implementation can be enhanced by adopting
Web caching policies like the piggyback caching algorithm[7]. Using the
piggyback approach, when a client makes a http request, client agent piggybacks
the list of its cache contents as part of the request. In response, the
Web server does not send brand-new documents that are contained in the
piggybacked list, this can result in reducing the network traffic.
Another method of collecting brand-new documents is to create a special
directory for them. When Web server applications add(or modify) documents,
they put the documents under this directory. Then, brand-new agent does
not have to search for whole directories managed by the Web server, instead,
the Web server considers documents under the directory as 'brand-new'.
With slight modification of the client agent from the basic architecture of the brand-new approach, our prefetching method can be easily extended to the multi-level proxying structure. On each level, client agent, Ci, cooperates with the proxy to prefetch documents from the next level client agent, Cj, where Cj acts as a server to Ci and a client to the next level proxy.
In the push environment, documents which are moving through the network
are almost recently created or modified. Therefore, our brand-new prefetching
method will show much higher performance in this environment.
KTRDG(Korea Telecom Research and Development Group) (rcunix.kotel.co.kr) : Traces from KTRDG's homepage
We know that it is important to use traces from a variety of sources, since different server traces display access patterns from different client bases. But, there is a limitation to evaluate performance of our prefetching approach using the trace-driven simulation. That is, we need additional information, the set of brand-new documents, to simulate our approach. But those documents cannot be obtained from trace logs. To get the set of brand-new documents, we should access whole directories that the Web server manages. We used trace logs during one week of about 500,000 requests. We preprocessed the original traces and removed requests to "cgi" programs.
We use the hit ratio and traffic increase as performance measures in
our experimental evaluation. Hit ratio is the ratio of requests that are
served from prefetched documents to total requests. The higher this ratio
is, the lower the client latency and the server load. Prefetching approaches
inherently impose the traffic increase by unsuccessfully prefetched documents.
Since no program can predict the future, some of prefetched documents will
not be actually requested by clients, and thus should not have been prefetched
in the first place.
Figure 3 reveals that results of the traffic increase as a function of the cache size. The traffic increase is mainly caused by unsuccessfully prefetched documents and the piggyback overhead for delivering the cache information to the server.
Figure 2 : Hit ratio without proxy. Figure 3 : Network traffic.
Figure 4 and 5 plot hit ratios and traffic increase for proxies as a
function of the cache size. We see that hit ratio of the prefetching using
proxy is raised to about 40% comparing to the Figure 2. But there are only
slight changes on the traffic increase.
Figure 4 : Hit ratio using proxy. Figure 5 : Network Traffic using proxy.
SN = {Set of the first NB
elements from brand-new prefetching documents}
UNION
{Set of the first
NT elements from Top-10 prefetching documents}
where, NB
= NP * HB / (HB
+ HT)
NT = NP * HT
/ (HB + HT)
NP = The total number of documents to be prefetched
HB = The hit ratio of the brand-new prefetching approach
HT = The hit ratio of the TOP 10 approach
NP can be determined by analyzing client behaviors.
For the frequent clients, its value will be large. On the other hand, it
will be small for the occasional clients. The HB and
HT, which control the prefetching mode, can be also
defined as system parameters of the Web server according to the tendency
of clients who access the server. If most clients are of type 3, HB
will be relatively large.
To support prefetching in our mechanism, the HTTP protocol between Web clients and Web servers needs to be enhanced to allow servers to deliver brand-new documents in addition to the requested documents to the client agent.
Our simulation results show that a substantial reduction in latency perceived by clients can be achieved at the cost of a similar increase in the network traffic. Since the retrieval time of a document includes a substantial startup latency, prefetching is more effective in reducing the access time than just increasing the bandwidth.
The brand-new prefetching we propose is specially appropriate for the Web servers such as news web sites like CNN and electronic shopping malls like Amazon, where documents in the server are frequently created or modified and Web clients have tendency to get up-to-date information. In addition, our prefetching method will show much higher performance in the push environment where clients usually want to get the latest information.
We leaves experiments for combining our method with existing prefetching
methods as further study.