Brian D. DAVISON <firstname.lastname@example.org>
Rutgers, The State University of New Jersey
Replaying or simulation using trace logs is a common method of evaluating proxy caches. Unfortunately, publicly available trace logs usually have serious deficiencies because of what is omitted from them. In this paper, we enumerate a number of these deficiencies and examine the benefits of more complete logs. In addition, we investigate some additional insights available from logs containing full content, and how they might apply to content-based prefetching proxies.
Almost all Web system performance evaluation concentrates on the use of Hypertext Transfer Protocol (HTTP) traffic logs. This paper argues, however, that most HTTP traffic logs are flawed. These deficiencies include inaccuracies but are mostly errors of omission and should be taken into account when considering logs for performance evaluation or workload characterization. The significance of this paper is twofold: (1) it raises issues of HTTP log utility for performance evaluation and workload characterization, and (2) it demonstrates the kind of information available from analyses of full-content logs (that is, those that contain both headers and bodies of HTTP requests and responses), especially for content-based prefetching caches.
We use both proxy and origin server logs in our analysis. To demonstrate the impact of the flaws we find, we use full-content proxy trace logs captured locally. Inaccuracies that we examine include logs that are out of date, that include out-of-order requests, and that do not reflect actual page modification times. In addition, we point out that logs are often captured for client populations other than a desired target client population, and that difficulties exist in client workload analysis because of a lack of a consistent user-to-client Internet Protocol mapping. Omissions that we cite include lack of HTTP headers such as referrer and browser tags, Pragma: no-cache instructions, cookies, and full uniform resource locators (URLs) (including query parameters).
When full-content logs are available for analysis, the request traffic can be more fully characterized. Noted characterizations include analyses based on the contents of the referrer field, the potential for content-based prefetching, the number of links per page requested, and truly cachable responses.
In our concluding remarks, we recommend that additional studies with larger full-content traces be performed, and that proxies should pass as much header information as provided by the client so that smart upstream proxies or servers can use that information to improve performance.
In dynamic systems such as the Internet, it is common practice to periodically record samples of the system's activity. Those samples are then used to characterize the activity in the system and to evaluate new mechanisms for use in the system. We are not the first to note difficulties with standard log files. For example, Slettjord  notes:
We would also like to know how many cachable documents that are non-cachable due to last-modified or expire headers that are set in such a way that an object expires before it is served from the origin server. This would give us an rough idea of how widespread cache-busting techniques are. But this information is not currently available from the log-file.
On the World Wide Web, logs of HTTP traffic are recorded continuously as a function of most origin Web servers as well as intermediate proxies. The primary function of these logs are to chronicle the operation of these systems; however, as samples of HTTP activity, logs generated by these systems (and others) are also used for characterization, evaluation and reporting. Occasionally, researchers capture HTTP traffic via other means, such as from augmented client browsers [8, 20, 32] or packet sniffing [12, 14, 17, 31].
Analysis of Web server logs for the purposes of reporting traffic patterns for advertising or customer analysis is common. In addition, proxy cache logs are sometimes analyzed for the calculation of popular Web sites. Neither purpose pertains to the topic of this paper, and neither is mentioned further.
Instead, the rest of this paper will focus on the use of HTTP traffic logs from various sources for characterizing Web traffic patterns and trends [12, 23, 28, 29]; for building analytical models of the same that can be used to generate artificial logs with the same patterns; and for testing, evaluating, and tuning systems such as proxy caches, switches, and Web servers [1, 2, 3, 11, 19].
Table 1: The set of trace logs analyzed in this paper and a few of their characteristics
Table 1 lists the various logs used in this paper. Note that each needed varying amounts of cleanup, depending on the software that generated it and the purpose for which the log is being used.
In addition, log O2 used the common practice of generating a separate file with URLs and their referrers but did not use a timestamp. Thus, matching the original request with each referrer is a nontrivial and error-prone process. Web site logs, on the other hand, do provide referrer tags.
Most performance evaluation has been based on HTTP 1.0 traffic logs. The new features of HTTP 1.1  are novel enough to significantly change traffic patterns. Cáceres et al.  propose techniques to convert HTTP 1.0 logs into semisynthetic HTTP 1.1 logs.
Cáceres et al.  demonstrate that low-level details, including the presence of cookies in HTTP headers, are significant factors in the performance of caching systems. Finally, Krishnamurthy and Rexford  discuss robust mechanisms for cleaning HTTP logs.
HTTP logs provide snapshots of the use of Web resources at a particular time. Unfortunately, because the average lifetime of a Web page is short (less than 2 months [16, 21, 24, 33]), any captured log loses its value quickly as more references within it become no longer valid, either by changing content or by becoming inaccessible. For example, when replaying the request trace P1 in late August 1998 (requests only 1-4 months old), approximately 10% of the requests resulted in a 400-class error. Likewise, Buff et al.  found that about 25% of the references in a week-old trace from a major Internet service provider were outdated (and thus removed them for the purposes of using the trace to build a proxy cache model).
Most logs need some kind of cleanup before analysis can be performed . However, it is possible for an otherwise clean log to reflect an inaccurate request ordering, as shown in Figure 1. Note that the request for the main page follows requests for three images on that main page. This is because Squid records the timestamp for the completion of each request. Fortunately, this log includes processing time, so the request time can be calculated and a correct ordering can be generated. This is important for modeling in general, and prefetching systems in particular, because temporal and spatial patterns may not be preserved under a request completion ordering. Note also that the current practice of logging requests with a timestamp down to the millisecond is not always sufficient to distinguish between requests as processor and network speeds increase.
Figure 1: This excerpt from proxy log P1 generated by Squid 1.1 records the timestamp, elapsed-time, client, code/status, bytes, method, URL, client-username, peerstatus/peerhost and objecttype for each request. It is also an example of how requests can be logged in an order inappropriate for replaying in later experiments.
Finally, proxy cache trace logs are inaccurate when the proxies return stale objects because they may not have the same characteristics as current objects. In fact, since proxy logs don't reflect actual page content change times, prefetching simulations that use trace logs cannot know when they have prefetched and are storing what will later be used as a stale object. Even when run conservatively, caches sometimes return stale data, much to the consternation of users trying to use their browser to automatically open, via File Transfer Protocol, the latest draft of a colleague's paper. To determine exactly how much stale data a proxy cache is generating, one might request the object from the origin and compare it to what the cache returns .
When using logs for evaluation, it is important to consider the client population and workload generated by it . The patterns of usage and performance seen by a first-level workgroup or enterprise proxy cache may differ considerably from that of a high-level cache that serves as a parent only to other caches and not to clients directly. Likewise, the logs of a university httpd server with mostly static pages generates browsing patterns among clients that are different from those of highly dynamic (and thus less cachable) pages of a commercial site. In addition, cultural issues may affect the workload generated (e.g., users in Brazil access different kinds of sites than users in the U.S. ).
Finally, when analyzing logs to build user models, one must carefully consider which logs to use, because in many logs, a single client may represent the combined requests of multiple users because of the use of proxies or multi-user machines. A single user may also be represented by multiple unique client identifications as a result of dynamic Internet Protocol allocation (common in dialup connections, but also sometimes used for infrequently used local area network connections).
Most proxy and origin servers record only a small portion of each HTTP request and/or response, and even when they support the extended log format , they are usually not configured to record more than that shown in Figure 1. Logs generated by httpd servers sometimes contain browser and referrer, but these are often not associated with particular requests (as in the case of log O2). The browser header can provide information as to the client capabilities and might help explain the client distribution of HTTP/1.1 support, or the effect of well-known browser bugs. The browser identification is also useful when cleaning logs of crawler activity. Examples of referrer tag use are shown in the next section.
Caches that prefetch on the basis of the contents of the pages being served (termed content-based prefetching), such as CacheFlow  and Wcol , need at least to have access to the links within Web pages -- something that is not available from server logs. Even if page contents were logged (as in references [14, 27] and log P2), caches that perform prefetching may prefetch objects that are not on the user request logs and thus have unknown characteristics such as size and Web server response times.
When we replayed a relatively small trace log (as part of separate work), we found that the average size hit for the proxy cache was larger than the average miss. When this happens, one is understandably suspicious. In this case, the average had moved because a few large requests were repeated approximately a dozen times. On further examination, this effect was seen to be the result of one user in an authoring mode reloading a page with some unusually large images many times while the page was under development. Without those images, the average hit and miss sizes for that trace were more reasonable. This anomaly demonstrates the utility of additional request headers; in this case, the knowledge and reuse of Pragma: no-cache would have eliminated the problem. Because logs generally do not include this kind of additional request header information, those requests that were originally forced to bypass the cache (with a Pragma: no-cache header) no longer have to do so in a replayed situation and thus artificially inflate the resulting hit rates. In fact, this is typical of the larger problem in that many logs do not show whether a response is cachable -- cookies are not present, URLs that are queries have their parameters stripped, and so forth.
This means that proxy logs should not be used for comparing the performance of the system that generated the log to others that use the log for simulation. Such comparisons  are unreasonable, because in most cases we do not know whether the origin servers support HTTP/1.1 or whether the client used a "refresh" function, and we do not know about cookies, and so on, all of which affect cachability in the real world but are not present in captured trace-based simulations.
Many logs do not show the service or transmission times. This deficiency makes it difficult to estimate the "thinking time" of a user accurately, or the user's bandwidth. Neither log O1 or O2 record this. Apache  has the option of recording this, but it is not used by default. Finally, note that most logs are missing some requests (i.e., an httpd log only shows requests to that httpd server; the typical proxy cache log shows only port 80 requests and ignores File Transfer Protocol, Secure Sockets Layer, etc., which are valid, but less common, Web transport mechanisms).
When additional content is available, analysis may provide new insights. For example, one might find that client prefetching based on the contents of the current page is valuable, because approximately 80% of requests are made from the current page. While this is consistent with a user study , it can also be confirmed by calculating the percentage of HTTP responses of type text/HTML (Hypertext Markup Language) that had requests with a Referrer: tag, as shown below:
This is evident in logs that show the referrer request tag, and generally the percentage of referrer tags over all requests was approximately 10 percentage points higher. The referrer tag can do more -- it can provide hints of the client's true browsing pattern (because many requests, such as [back], are normally handled internally by the browser cache).
Table 2: Additional statistics available from augmented proxy logs. Note that because trace P2 was collected by an HTTP/1.0 proxy, there could be no HTTP/1.1 responses. For comparison, note that Cáceres et al. report that over 30% of Internet service provider traces contained cookies
In Table 2, some additional statistics can be found that may be relevant for cache design or workload modeling. The cachability of a workload is also often of interest. A common mechanism of caches is the use of a stoplist on the URLs. A conservative cache operator might set the stoplist to contain "?", "cgi", and ".asp", which would cause the cache to never cache any URL with those substrings. In log P1, 8.3% of the URLs contained a stoplisted substring or resulted from a POST. In log P2, this value was 13.7%. Because both of these logs are augmented with all request and reply headers, we can obtain a better (although still incomplete) estimate of uncachable objects by also discarding those with "no-cache," "set-cookie," "max-age=0," "Expires: 0," and "Expires: Thu, 01 Jan 1970." For this augmented stoplist, our uncachable percentage rose to 19.3% and 31.6%, respectively. Finally, if we make the HTTP/1.0 assumption that cookies signify uncachable data as well, it rises to 26.1% and 54.4%, respectively. Note, however, that these are representative but incomplete statistics and analyses, for which additional content is possible.
Real-time prefetching systems rely on the existence of "thinking time" -- the time between page requests -- to provide time in which to prefetch the objects likely to be requested next. Past research [12, 14] suggests a heavy tailed distribution of thinking times with a mean of 30 seconds. Logs P1 and P2 have averages of 79.4 and 42.6 seconds, respectively, for the thinking time between HTML page requests (using the traditional 30-minute threshold for session breaks). When calculated as the time between requests, the average rises slightly, to 82.8 and 46.8 seconds, respectively.
Figure 2: The distribution of the number of links per non-error HTML page in trace P1.
Figure 2 shows the distribution of the number of links per page from trace P1, which has an average of 25.4 unique, non-self-referential links per HTML page (and does not include embedded tags, e.g., images and sounds). Note that this histogram reflects the distribution of pages requested by users, versus the more or less static distribution of pages on the Web (such as described by Bray ).
Figure 3: The distribution of the distance in terms of the number of requests back that the current page could have been prefetched.
A more complex analysis is shown in Figure 3. Prefetchable pages are those that are considered unlikely to have adverse side effects (e.g., no cgi). It shows that more than 62% of prefetchable pages can be reached by examining the links of the current page and its immediate predecessor in the request stream.
This paper has described a number of common deficiencies in proxy and origin server logs. Although some of these omissions may represent a conscious decision toward privacy, others are likely to be simply a function of system defaults. In any case, the lack of information can seriously affect the resulting characterization of Web traffic using those logs.
For most analyses, server logs that are augmented with all request and reply headers can provide stronger and more detailed characterizations than those typically made available. For prefetching systems, even full-content logs may be insufficient. In general, having more information facilitates the creation of more accurate theories and models and is necessary for postulating and evaluating more complex caching mechanisms, such as content-based prefetching.
Therefore, we recommend performing additional studies with larger full-content traces, because the results presented here are only representative of what can be determined with correct and complete logs. We also recommend that proxies pass as much header information as provided by the client so that smart systems upstream can utilize it.
Finally, we have argued that proxy logs cannot be used for comparing the performance of the system that generated the log to others that use the log for simulation. Such comparisons are unreasonable because the logs lack information that affect cachability in the real world but are not present in captured trace-based simulations.