Chris Weider, Consultant
Bunyip Information Systems, Inc.
This paper discusses the possible near-term (before 2001) future of search on the Internet, including techniques for massively distributed search such as that used in Whois++ and Harvest, and an analysis of the long-term success of centralization techniques such as those used by DEC's Alta Vista service.
It also discusses the metadata framework necessary to build scalable solutions to the search problem, and the techniques necessary to provide search in an environment which contains a mix of completely open and access-controlled resources.
The Internet contains the most massively distributed set of data in history. The sheer scale of the Internet means that search solutions which have worked in the past are no longer helpful, and that entirely new classes of problems are requiring novel solutions to the problem of locating relevant resources. While this problem has been alleviated somewhat by search services such as Lycos and Alta Vista, there is still a long way to go before searching on the Internet becomes efficient and effective. A solution to the problem of Internet search must be able to scale to a system orders of magnitude better than the system today, and one which contains a far vaster collection of access-controlled, for-fee and semi-private resources.
One critical question for Internet search is: "Should services be centralized or distributed?" By a centralized service, I mean that the data is kept in a single server, even if that server happens to be widely replicated. By a distributed service, I mean that the indexing information for a set of resources is logically distributed across a set of servers, and techniques for transmitting the query to the relevant servers have been implemented. Each of these techniques has tradeoffs which we will examine in Section 2.
A second critical question is: "How can these services be built in such a way that is friendly to the Internet?" There are a number of protocols in use on the Internet (the protocols used in the World Wide Web chief among them) which are actively hostile to efficient use of scarce resources such as long-haul bandwidth. This paper presents some principles for new services that will reduce this sort of congestion.
A third critical question is: "How can these services be deployed?" This paper discusses the infrastructure that will need to be developed to allow robust search.
The last question this paper addresses is: "How do we search in an access-controlled environment?" This will be critical if we are to avoid the loss of availability of resources as more and more semi-private information spaces are deployed on the Internet.
The centralized search model, used by services such as Lycos and Alta Vista, create a vast index to most of the resources available on the Internet and serves it through a single machine. There may be copies of this index replicated at various places on the Net, but each is served through a single machine as well. The index is created by downloading the individual documents, indexing them, and then throwing them away.
The distributed search model, used by services such as Whois++ and Harvest, creates smaller indexes placed "closer" to the documents they index. Additional tiers of indexes create intermediate indexes with larger and larger scope, and it is possible to create an index which covers everything by putting a "top" index server on the intermediate indexes. The following diagram shows how this technique works.
In this diagram, Index Server A indexes Base Level Servers 1 and 2; Index Server B indexes Index Server A and Base Level server 3; Index Server C indexes Base Level Servers 4, 5, and 6, and Index Server D indexes Index Servers B and C. Thus, a query issued to Server D will search all of the base level servers.
The centralized model may at first glance seem to have all the advantages over the distributed model. The index is all in one place, so is likely to give much more rapid response than the distributed model. The index is probably easy to replicate, so issues of congestion and server overloading can be avoided. The organization providing the index has complete control over the server and can set whatever policies they like. And centralization provides a single point for the collection of usage data.
However, a closer look will show that the distributed model has compelling advantages as well. The first is that centralization is probably not going to scale very well as the number of information generators on the Internet (people and machines) increases and as the amount of information required on each resource increases in size and complexity. A simple scaling argument suffices to show that an order of magnitude increase in the number of computers acting as information generators forces an order of magnitude increase in the centralized index. And as we evolve to a deployable framework for metadata on the Internet, the amount and richness of metadata will also add an order of magnitude to the index data. While this metadata may be some small percentage of the size of the document for large text documents, it is likely to be a much larger percentage of the size of the document for small text documents such as e-mail messages and Usenet News postings. And the amount and types of metadata required to truly index multimedia documents (most multimedia indexing today simply indexes textual metadata associated with each document) are still unknown, but likely to be large. All this will add (I believe intolerably) to the size of a centralized index. It will, of course, add even more to the size of a distributed index, but since it is distributed, the demands on the servers participating in the index can be kept under control.
The second is that centralization will be harder and harder to keep up to date. If centralized services continue to download the entire Net to their home machines, the required bandwidth and processing power will grow more rapidly than the existing tools can handle in a timely fashion. Distributed indexing services can update their individual indexes by looking at a much smaller amount of data, and can thus be much more efficient at staying up to date. There are still some open questions on how rapidly indexing changes can propagate through an indexing structure, although we can probably tune the index update protocols to provide almost any update behavior.
The third is that centralized services force you to search everything. This is their strength and their weakness. For example, it is extremely difficult to limit searches to documents from North America, for example. Distributed indexing technologies allow the creation of geographical or topical index servers which can then be combined into larger and larger collections. The arrangement and coverage of the index servers can themselves be a powerful tool for focusing and restricting queries. Achieving these effects with centralized services will require a substantial amount of user education to use the specialized search terms and navigational techniques required to focus the queries. And it is very likely that specialized index collections will proliferate on the Net, so we should take advantage of them if at all possible.
There are a number of other problems with centralized services but we will address them when we discuss searching in an access-controlled environment.
Resources on the Net are getting bigger as more and more multimedia goes online. Individuals retrieve these resources from all over the Net, and this means that in the current environment, each of these resources must traverse the long-haul channels one time for each access. This causes massive amounts of congestion and causes real service degradations in single-threaded network paths such as the link from the United States to Europe. Caching is increasingly seen as a way to ameliorate these problems but there are a number of issues which critically impact search but have not been examined in detail yet.
The first of these is locating cached copies of documents found through a search. The current cache strategies typically cache documents at a proxy server which accepts the resource request, retrieves it, keeps a copy and delivers it to the user. These resources are identified by the uniform resource locator (URL) used to access the document. Proxy servers are typically deployed one to an organization or one per firewall server. This helps reduce the amount of traffic on the Internet, and for large organizations such as America OnLine, the savings can be pretty substantial. However, there are some trends which are causing this strategy to be less and less effective as time goes on, and there are some techniques which may help caching on those objects which are indeed cacheable.
One weakness in the current strategy is that caches between organizations whose proxy servers may be very close together cannot share their caches. For example, a copy of a document homed in the U.S. may have been downloaded to Australia but only one Australian organization can use the cached copy. One way to solve this problem is to create a caching infrastructure that provides a distributed cooperative cache. However, there are two major missing components for this to work. One component is a globally deployed infrastructure for location-independent names, whether universal resource names (URNs) or some other global namespace. With these services, access to resources is made through the location independent name, which is mapped into a location dependent name (such as a URL) and then retrieved. Since access is through the location independent name, the mapping to a URL can be updated with the locations of all the cached copies. There are several services in this arena which are starting to be deployed, most notably OCLC's Persistent URL (PURL) service and CNRI's Handle service  . However, these services need to be much more widely deployed if they are going to help address this problem. The second component missing is some sort of network metric for "distance" between a client and a resource; without this the client is just randomly picking one copy from a set of cached copies with no real idea of where each resource is. This lack has the potential to completely defeat attempts at caching. Current proxy services get around this because a given client funnels all requests through one server; so a resource is either on the proxy server or must be retrieved from its original location.
The second of these is that in many cases the resource retrieved is the result of a database search, form submission, or other server interaction, so the resource is unique to that transaction. Thus caching it would be next to useless as it is highly unlikely that a given search is likely to be exactly duplicated. This will increase as more and more work is done with interactive scripts that provide personalized resources. This trend causes two problems with respect to searching. The contents of a resource or database may not be adequately described to the indexing service because so much of the contents are generated on demand, thus making it impossible to find relevant resources of this type. Also, in cases where the result is useful again because the desired resource is already in the cache although a given user could not replicate the query precisely enough to determine a match existed, current clients have no way of determining the contents of the cache on the proxy server. The only way I can see of handling this problem of unique resources is to provide as much metadata as possible about the contents of the resource and making it available to the various search engines.
In any case, the entire field of document caching is screaming out for a serious detailed analysis. And it should be done as soon as possible because we are at a period in the Net's evolution where much of the necessary data is obtainable; once access controls and privacy considerations become prevalent, it may be impossible to determine trends and gather data. It may very well be the case that what we learn now will not be relevant five years from now, but this will at least give us a foundation for understanding the evolution of information flows on the Internet.
With the exception of the large services such as Lycos, Yahoo, Alta Vista, and 411, search on the Internet is no more effective that it was two years ago. It is true that we can search vast chunks of the Net now, but the tools that we bring to bear are only slightly more sophisticated. A large part of what is missing is a robust metadata framework for Internet resources. At this point, our search tools give us very little information about a resource; more information would allow a richer search and better determination of whether a resource was useful or not before we retrieved it.
A robust metadata framework for the Internet would consist of a number of components, not all of which have to be in place immediately to provide a major advance over the services which are now provided. The most important component is simply the generation and binding of metadata of some sort to individual resources.
Most resources have some metadata which can be easily discovered simply by examining a copy of the resource. This metadata includes the size of the resource, its format, and in many cases the date of creation. This information is typically computed and made available as ancillary information to search results. It is typically not the case that this type of information is searchable, and, given the paucity of discrimination it provides, is typically not worth revising search engines for. Work has been underway for more than a year in various areas to provide some sort of metadata adhering to resources. This work includes simple schema to provide a consistent set of metadata attributes, common carrier formats for metadata, and attempts to provide techniques for building metadata into the resources themselves.
Metadata schema have a long and rich history in the digital library community. There are schema sets for bibliographical data, scientific and technical data, geospatial data, and many more. However, most of these are either too focused or considered far too heavyweight to provide a good basis for building metadata for standard Internet resources. Consequently, a meeting was held in the spring of 1995 which brought together a number of experts from the cataloging, digital libraries, and Internet communities, designed to build and promote a very simple common schema which would provide a first step towards metadata for all. The results of that meeting, the Dublin Core , consists of 12 simple attributes which can be easily built into tools and filled out by the author of the resource or by someone else. Some experimentation has been done using the Dublin Core, and a second meeting was held in April, 1996 to further the work done so far, including techniques for embedding the Dublin Core in Hyoertext markup Languate (HTML) documents. The results of this meeting are being referred to as the Warwick (pronounced Warrick) Framework, and the report should be available by the time you read this. There seems to be a fair bit of momentum towards using this schema set more widely in the Internet. It remains to be seen whether the Warwick Framework will catch on, but the outlook is optimistic.
There are two approaches for common carrier formats for metadata; the URC work which has been done in the Internet Engineering Task Force (IETF), and a carrier format developed as a part of the Warwick Framework. It's quite likely that the two approaches will interoperate, and that there will be a number of protocol-specific approaches as well. Specifications are under development to embed these approaches into individual protocols.
Other components of a robust metadata infrastructure include structured search tools which can provide this metadata as a useful addition to existing search techniques. Tools such as Whois++ and Harvest are already being used to create experimental systems in this area. More advanced components would include tools for browsing metadata, adding thesauri to the tools to provide more robust searches, and multilingual capabilities. Work will proceed on these components when the basic infrastructure is in place.
One striking feature of the current search facilities on the Internet is the fact that they all assume that the information they index is freely available, or at the very least can be indexed publicly. While this is certainly true at the moment, we will see an increasing amount of information locked off in private or semi-private "intranets." And, as a charging infrastructure is deployed on the Internet, we will also see an increasing number of resources which are not free but are publicly available. Searching in this environment will be substantially different from search in the current environment.
The critical question is whether the cost or access control functions provided by a given local server need to be maintained by servers which index it. In most cases it would seem that the cost function should typically not be maintained by the index server; this might well involve double billing for the same information or perhaps a sophisticated transaction protocol to pass along the information that a given access had already been paid for. However, if there is an actual monetary payout to get the information that an index server wishes to index, it is likely that there will be at least a few indexes which charge for access.
The access control function is much more difficult. The infrastructure necessary to maintain security contexts in a global environment seems to be extremely heavy, particularly since for at least the forseeable future security contexts are likely to be "locally" defined for information. One possible solution, simply providing separate servers for each security context, sounds good but may be very hard to maintain in practice. This will be even more true when most business-to-business communications are done through Net tools; there may well be a separate security context for each customer or partner in a relationship.
A second consideration for access control is trying to determine whether a given document can be reconstructed from the index provided for it. This may prevent people from providing any information at all for their resources unless the entire indexing mesh guarantees the security of the information.
One possible solution to all these problems is the concept of an "indexing proxy." This would be an interface to the information resources of an organization which would provide conditioned indices of the internal information to index servers, and would be responsible for insuring that no information was made searchable that shouldn't be searched. This would allow more flexible search in a mixed environment. However, this approach does have some potential drawbacks; if charging is required for access rather than for a successful search or retrieval, it is possible to have rogue proxies that claim to have everything just to rake in the access fees.
The infrastructure necessary to do a robust search on the Internet is just starting to be deployed. It will probably take another two years or more before we have a rich metadata infrastructure deployed, and another year or so past that to get search tools to use it effectively. We will see a big increase in the number of resources which charge for access to data, and that will require us to decide if the Internet will maintain its current freely available search tools as well. And none of the more advanced search tools, such as "agents," "knowbots," or "natural language query," will be possible on the Internet until this basic framework is deployed.
Chris Weider is an independent consultant on Internet directory services and information architecture. He is a member of the Internet Architecture Board and has been a leader in Internet information systems architecture and design since 1990. He is also well known for his work on directory services protocols, particularly his creation (with Peter Deutsch, Jim Fullton, and Simon Spero) of the WHOIS++ directory service. He has chaired a number of IETF working groups, most recently the Integration of Internet Information Resources Working Group. He received B.Sc. degrees in Mathematics and in Computer Science from the University of Missouri in 1987, and a M.A. in Mathematics from the University of Michigan in 1992.