Pedro Falcão Gonçalves <email@example.com>
Silvio Lemos Meira <firstname.lastname@example.org>
Ana Carolina Salgado <email@example.com>
Universidade Federal de Pernambuco
The growth of the World Wide Web challenges the available mechanisms for global information indexing and searching. As a consequence, keeping general-purpose centralized index bases up to date and complete becomes increasingly more difficult. This paper presents a distributed mobile code-based architecture to deal with the problem of indexing and searching the Web. We describe the system's structure; define and discuss its underlying concepts of Web views and digital neighborhoods; and show how a distributed, scaleable, and interoperable solution is achieved. The current and future versions of the prototype are discussed, and its platform-independent features are highlighted.
The size and growth of the World Wide Web (Web) challenges the indexing and searching mechanisms currently in use [1,2]. These mechanisms are generally global (i.e., intended to cover the whole Web) and centralized (i.e., the processing of documents and indices is done in a single point of the network) . Related problems exist in the integration of digital libraries [4,5], with respect to the amount of information to be indexed. Recently, several architectures have been proposed to deal with this problem from a networked information retrieval perspective [6,5,7,8].
There are several popular publicly accessible indexing and searching systems on the Web (e.g., [9,10,11]). These systems generally rely on software robots [12,13] that retrieve and process every Web page they find, extracting reference information (indices) about it. The indices are then stored in an index base (IB), which generally offers keyword- and expression-based search services.
Centralized global indexing systems, however, do not scale with Web size. Considering the amount of data stored on the Web, the basic demands of processing workload, storage space, and network bandwidth far exceed the ability of a single indexing robot to cover the whole Web and be kept up to date. From the Web server perspective, the proliferation of Web robots causes extra load, and robots are among their most demanding users at many sites (e.g., ). For global network traffic, this also means waste of bandwidth, as various robots retrieve the same documents for indexing and repeat the operation later to check for updates. Nevertheless, users expect complete and up-to-date indices.
This paper proposes a distributed architecture to deal with the problem of information indexing and searching in distributed environments. The Web is selected for a case study of the distributed information system environment. The proposed system is open and scaleable , and possesses a potential for collaboration with other systems through interoperability .
In Limitations of the Centralized Approach, the main limitations of the centralized approach to Web indexing and searching are discussed. The Proposed Distributed Architecture defines Web views and the intrinsic concepts of digital neighborhoods, and presents a distributed approach to tackle the problem. In Scalability of the Proposed Architecture, the scalability of the proposed structure is discussed. The working and future versions of the prototype are described in The Working Prototype and Future Extensions, and Conclusion adds concluding remarks.
The centralized approach to Web indexing and searching relies on an indexing robot and an IB. Various copies of that robot may be running from different network locations, but the IB is still based on a unique database schema . In addition, the IB may be replicated in different points of the network to improve response time (e.g., one copy in the United States and another in Europe). Therefore, this approach is centralized in nature, although there may be some degree of parallelism in the way indexing and searching are performed.
Figure 1 illustrates the architecture of these centralized systems. The Web server layer represents the Web and its documents, which contain the information to be indexed. The index server layer contains traditional centralized Web indexing and searching systems in which the indexing scopes are the entire Web. The user layer corresponds to people who access index servers (typically using a Web browser), and application programs that interact with index servers (e.g., Metacrawler , Digisearch System  and Fusion Search ).
This centralized approach, adopted by most Web indexing and searching systems , has limited scalability to Web growth. Three aspects of these limitations are commented on below.
The usual strategy consists of retrieving Web documents and locally processing them to extract indices. To index the entire Web, it is necessary to retrieve and process every existing Web document. To be kept up to date, the whole process should be redone periodically. This strategy has clear limitations in scaling up with Web growth.
Indices are usually stored in a database under a single schema . Although various servers may be used, the schema is usually centralized. With the current Web size and growth, it is increasingly more difficult to store index references to every Web page in a single database.
Searching in IBs
If the indices are stored in a single database, queries will tend to be more expensive as the data volume grows in that database. In principle, the system should be able to respond efficiently to requests from many interactive users with reasonable response time. Therefore, there are also limitations with this aspect in scaling with Web size.
The distributed architecture presented here is based on a set of autonomous modules for indexing the Web: the index servers (ISs). Each IS has its own IB and focuses on a particular scope within the Web. Each IS also runs a Web robot that continuously traverses the Web to look for pages that fall within its scope, and therefore should be indexed. ISs cooperate by exchanging indices.
A higher level of abstraction is built on top of the index servers: the index brokers (Brokers). They form another set of autonomous modules that select the appropriate ISs to respond to user requests, and then manage search sessions, possibly involving multiple ISs. This structure is depicted in Figure 2.
As shown in Figure 2, each Web server WSi relies on a database (DB) where its Web pages are stored; each index server ISi has an index base (IB); and each index broker Bi has a scope base (SB), where scope information about known ISs and pointers to other brokers are kept. Index servers request pages from Web servers for indexing, and index brokers request indices from index servers to respond to users' search requests. Index servers cooperate by exchanging indices (e.g., IS1 and IS3 in Figure 2), and index brokers cooperate by exchanging metainformation about index servers (e.g., B1 and B2 in Figure 2).
Two goals motivated development of the Web views concept (Definition 6) for the problem of indexing and searching: (1) to achieve computational efficiency, through an architecture that allow for a computationally feasible solution of this problem and (2) to achieve effectiveness by providing the means to locate documents relevant to users' queries. In order to embody these goals, the definition of Web Views takes into consideration the four concepts of digital distance, as stated in Definitions 1 through 4.
Definition 1: DD by geographic location (DDGL).
The DDGL between two documents is the physical distance, in kilometers (km), between the two Web servers where the documents are stored. This concept may be extended to deal with more complex shapes than a circle  in order to model more meaningful regions, such as states and countries.
Definition 2: DD by network location (DDNL).
The DDNL between two documents is calculated according to the Internet Protocol (IP) addresses of the Web servers where the documents are stored, as given in Table 1. IP network classes are defined in . The concept of network distance may be adapted to deal with other definitions of network locality (e.g., IPng ).
Definition 3: DD by hypertext location (DDHTL).
The DDHTL between two documents is the length of the shortest known hypertext path between the two documents in the Web. The DDLHT from any document to itself is zero.
Definition 4: DD by content (DDC).
The DDC between two documents is calculated according to a function Fddc: (Doc x Doc) --> N, where Doc is the set of all Web documents, and N is the set of all natural numbers. The DDC between two documents with identical contents is zero, and there may be documents with different contents whose DDC is also zero, depending on the Fddc function chosen.
A simple way to characterize a scope, given the definitions of digital distance, can be to call it a digital neighborhood (DN).
Definition 5: Digital neighborhood (DN) of type X.
The DN of type X and radius R around a reference Web document is a space in which are included all the Web documents whose DD of type X to the reference document is less than or equal to R, where X denotes a single type of DD (i.e., DDGL, DDNL, DDHTL or DDC).
The servers locations, geographical and topological, are used in the definitions of DDGLs and DNNLs, respectively. This allows for the identification of groups of Web servers that are either physically located within a given geographic region or connected to a given IP subnetwork. Indexing modules that focus on their geographical area or network neighborhood avoid the need to use long-distance backbones for indexing, and have a smaller number of pages to index.
The documents content is used in the definitions of DNHTLs and DDCs, respectively. This accounts for indexing effectiveness by introducing measures of similarity between documents based on their content and also on the hypertext links connecting them. Indexing modules can then be specialized in given subject areas. Techniques from the area of information retrieval may be used to choose the Fddc function of Definition 4. For instance, a simple function to compute the distance between two documents can be based on the comparison of lists of indexing terms from these documents . Furthermore, the relevance of hypertext distance (DNHTL) comes from the use of the structure given by authors to Web documents. This structure constitutes an important way to identify correlated information [24,25].
Web documents can be characterized by: (a) server location; (b) document content; (c) the object in which the document is stored (i.e., an ordinary file, database entry, hypertext document, etc.); and (d) metainformation extracted from the document's content (e.g., title and language). Attributes (c) and (d) constitute a document's representation scheme.
Given the definitions of DN and representation scheme, a broader way to characterize scopes may be defined as Web Views.
Definition 6: Web view.
A Web view is a composition of DNs optionally filtered by a set of restrictions on Web pages. These restrictions apply to the attributes used in the document representation scheme.
An example of search scope that could be modeled by Web views is:
Consider all documents that are similar in content to the document in "http://somehost/x.html", and that are stored in servers at most 200 km away from Washington, DC.
Restrictions such as "only documents updated later than 01.01.97" and "only documents written in French" could be added, as stated in Definition 6.
Since the modules are autonomous, the installation and configuration of each module is done independently. Yet modules should cooperate in order to share resources and optimize performance. To preserve autonomy, selection of server modules for cooperation is done dynamically, which also provides a degree of fault tolerance. Below, we analyze in more detail the three levels of abstraction of the proposed system: index servers, index brokers, and user interface/application program.
The index server layer
Each index server (IS) collects referential information about Web pages and stores it in a local IB. Figure 3 gives further details of ISs and their components. An IS's indexing scope determines which pages are of interest and should be indexed by the local robot. ISs also cooperate with each other to share indices, consequently saving indexing effort.
Figure 3 shows two index servers, IS1 and IS2, with indexing scopes given by the Web views WV1 and WV2, respectively. A nonempty intersection between WV1 and WV2 (i.e., WV1 ( WV2 ( () provides an opportunity for the ISs to cooperate in their indexing tasks. In the given case, the two ISs have agreed that: (1) IS1 will index its entire scope WV1; (2) IS2 will index only the portion of its scope that does not intersect with WV1 (i.e., only WV2 - WV1); and (3) IS1 will provide IS2 with the indices within the intersection.
The decision about which IS modules will provide indices for replication is first made when an IS is set up. At this point, a broker is invoked to search for ISs whose scopes have nonempty intersections with the new IS's scope. If a nonempty intersection is found (Figure 3), the recently set up module enters the operation mode. If the IS that is currently serving indices fails, the client IS restarts the configuration process (on the basis of the previous configuration) in order to replace the failing IS server. If no equivalent set of server ISs is found to replace the one that failed, the client IS may decide to index a larger part of its scope.
In order to identify intersections between scopes, an algebra of Web views is needed. A simplified version of such algebra is presented in [26,27], which allows for the verification of the inclusion relation between Web views.
The index broker layer
An index broker (broker) keeps referential information about ISs and their scopes. This information evolves as new ISs are installed, reset, or closed. Brokers also keep information about other brokers, so that they can exchange IS references.
Given a search request on a specified scope, the invoked broker tries to find a set of ISs whose scopes can be combined to cover the requested scope[S1] (the algebra of Web views mentioned previously in this section provides support for these operations.) If the invoked broker fails to accomplish this task, it then tries to collect additional IS references from other Brokers. This is how brokers exchange information.
As mentioned before, new ISs request service from a broker at configuration time. This is how new ISs become visible to brokers. A similar strategy is deployed when new brokers are installed. Provided that a set of brokers and ISs is bootstrapped at the installation of any system's module (IS or broker), the dynamics of referential information updates among brokers avoids the isolation of modules. This means that if an IS exists whose scope can be used by a given broker to perform a query, this broker will eventually find the IS by recursively expanding its list of references to other brokers and querying their respective IS's references.
The user interface application program layer
User interfaces may be pieces of mobile code typically running on top of Web browsers, or any other ordinary program acting as user interface to Web searching. The external access point to system functionalities is the broker. Each broker has its own URL and can be accessed via the HTTP protocol; therefore, standard Web access mechanisms apply .
The use of mobile code in the user interface is attractive for several reasons: (1) the interaction with the user is improved, once the mobile code runs in the user's station; (2) part of the search session's state may be kept in the user station, which makes it possible that search refinement operations can be partially performed locally with possible performance gains; (3) a simplified broker can run in the user's station to improve efficiency with the client connecting directly to ISs; and (4) the standard benefits of mobile code, such as assuming part of the processing load from the server to run on the client, operating without the installation of client software (the mobile code is transparently transferred to the client station), and multiplatform capability.
The architecture proposed in this paper provides an efficient and scaleable environment for information indexing and searching in the Web. The following four standpoints are considered to support this argument.
The information provider
From the standpoint of the information provider, efficiency means the ability to deliver up-to-date information to users at reasonable computational cost. This can be achieved by setting up a local IS whose scope covers the local Web servers that contain the information to be made available. From then on, there is no need to have several robots as the most demanding users of the local information, because the local IS offers indexes ready to be used. This structure scales with the proliferation of Web indexing robots.
The Web search engines
Web search engines provide users with up-to-date indices and the most complete coverage possible. With the proposed structure, search engines will be able to negotiate with several ISs and reduce the scope that is required to index. Therefore, this structure makes possible the allocation of computational resources to perform necessary work instead of redundant indexing. Coping with Web growth becomes easier, once it is no longer necessary for a global search engine to retrieve and index every existing Web page. Instead, it is possible to use indexes from other ISs to the largest extent possible.
The Web user
Once there is improved efficiency and scalability for the information provider and the search engines, it is likely that this will be able to provide better quality services to Web users who are trying to fulfill their information needs. Besides having improved quality, the user will no longer need to find specific search engines that specialize in their area of interest. Instead, the system will help the user to properly specify the area of interest and, by using Web views, transparently decide which ISs can provide better indices to resources in that area. As the number of indexing and searching systems grows, the level of search performance will improve, reducing search complexity for the user.
Redundant indexing represents a waste of network bandwidth. The Web's growth and the proliferation of Web indexing systems aggravates this problem. With the proposed architecture, redundant indexing is radically reduced and the exchange of indices between modules is used instead. Data compression may still be used to reduce network traffic. This being the case, Web indexing applications would not add considerable costs to the network.
A prototype implementation is currently in operation, running under the Solaris operating system and using a relational DBMS (RDBMS) to store the IB. Two other versions are under development, one for Windows NT with an RDBMS, and another for Solaris with an object-oriented DBMS.
A sample running module using the Brazilian Internet as its indexing scope is publicly accessible at http://www.di.ufpe.br/~pfg/bright.html. The indexing robot and the mobile code-based interface are written in Java , for portability and object-oriented reasons. The development of the next version includes the investigation, implementation, and testing of cooperating modules running on heterogeneous platforms in an open distributed architecture.
The platform chosen for communication between modules is Java RMI . To provide for independence from a particular DBMS, ODBC  was the framework chosen for interaction between the application and the DBMS. To account for interoperability, ROJ  is under consideration, as it provides integration between RMI and CORBA . The CORBA implementation under consideration for the prototype is Iona's Orbix .
The architecture presented in this paper and its underlying concepts of Web views and digital neighborhoods provides a means to handle the scaling problem due to Web size in the task of indexing and searching information. It was shown how cooperating modules are able to index limited portions of the Web more efficiently, and also how they exchange indices when needed. As a consequence, users, servers, and indexing services can save computational resources for themselves and for the network.
By working cooperatively, a community of specialized modules introduces a higher level of abstraction to the problem of locating information in the Web. On top of a network of information (the Web), a network of indices working cooperatively and consistently, in an open distributed framework, constitutes an efficient way to deal with the problem of information indexing and searching in the Web.