The world wide web is another domain in which operational collection and analysis of statistics is vital to support of services. Similar to our NSFNET analysis work, we have explored the utility of operationally collected web statistics, generally in the form of http logs. We analyzed two days of queries to the popular mosaic server at NCSA to assess the geographic distribution of transaction requests. The wide geographic diversity of query sources and popularity of a relatively small portion of the web server file set present a strong case for deployment of geographically distributed caching mechanisms to improve server and network efficiency.
We analyzed the impact of caching the results of queries within the geographic zone from which the request was sourced, in terms of reduction of transactions with and bandwidth volume from the main server . We found that a cache document timeout even as low as 1024 seconds (about 17 minutes) during the two days that we analyzed would have saved between 40% and 70% of the bytes transferred from the central server. We investigated a range of timeouts for flushing documents from the cache, outlining the tradeoff between bandwidth savings and memory/cache management costs. Further exploration is needed of the implications of this tradeoff in the face of possible future usage-based pricing of backbone services that may connect several cache sites.
Other issues that caching inevitably poses include how to redirect queries initially destined for a central server to a preferred cache site. The preference of a cache site may be a function of not only geographic proximity, but also current load on nearby servers or network links. Such refinements in the web architecture will be essential to the stability of the network as the web continues to grow, and operational geographic analysis of queries to archive and library servers will be fundamental to its effective evolution.
For very heavily accessed servers, one must evaluate the relative benefit of establishing mirror sites, which could provide easier access but at the cost of extra (and distributed) maintenance of equipment and software. However, arbitrarily scattered mirror sites will not be sufficient. The Internet's sustained explosive growth calls for an architected solution to the problem of scalable wide area information dissemination. While increasing network bandwidths help, the rapidly growing populace will continue to outstrip network and server capacity as they attempt to access widely popular pools of data throughout the network. The need for more efficient bandwidth and server utilization transcends any single protocol such as ftp, http, or whatever protocol next becomes popular.
We have proposed to develop and prototype wide area information provisioning mechanisms that support both caching and replication, using the NSF supercomputer centers as `root' caches. The goal is to facilitate the evolution of U.S. information provisioning with an efficient national architecture for handling highly popular information. A nationally sanctioned and sponsored hierarchical caching and replicating architecture would be ideally aligned with NSF's mission, serving the community by offering a basic support structure and setting an example that would encourage other service providers to maintain such operations. Analysis of web traffic patterns is critical to effective cache and mirror placement, and ongoing measurement of how these tools affect, or are affected by, web traffic behavior is an integral part of making them an effective Internet resource.