INET Conferences

	Conferences

	INET
	NDSS Other Conferences

A Framework for Developing Information Indexing, Filtering, and Searching Tools on the Internet

José CARDOZA <j.sanchez@computer.org>
Pedro F. GONÇALVES <pfg@di.ufpe.br>
Alessandro LIMA <aml@di.ufpe.br>
Luciana VALADARES <lvp@di.ufpe.br>
Cynthia TERCERO <c.tercero@computer.org>
Silvio L. MEIRA <srlm@di.ufpe.br>
Ana Carolina SALGADO <acs@di.ufpe.br>
Fabio Q.B. da SILVA <fabio@di.ufpe.br>
Universidade Federal de Pernambuco
Brazil

Abstract

Many information-indexing, filtering, and searching tools have been and continue to be independently built for the Web and intranets, with redundant software development efforts and low intersystem runtime cooperation.

Runtime cooperation among such tools represents an important possibility for saving both local and global communication and computational resources, whereas development-time cooperation should reduce project costs. This paper presents an object-oriented development framework for this class of tools that explores both development-time and runtime cooperation. We present the structure and implementation of the proposed framework, show how it can be used to support the development of cooperative systems, and discuss case studies of its use.

Introduction
Framework architecture
- Modeling Web content
- Framework structure
Extending the framework
Implementing cooperation
Case studies
- An object-oriented architecture for integrating Web-based information systems with unstructured data
- An information-filtering system for the Web
Conclusions and future work
References

Introduction

Many Information Indexing, Filtering, and Searching Tools (IIFSTs) have been and continue to be independently built for the World Wide Web (Web) and intranets, with considerable duplication of software development efforts. Furthermore, as each of these systems generally works alone, there is a significant amount of redundant data transfer, processing, and storage [1, 2], which makes them computationally inefficient from a global perspective, especially with the current Web size and growth.

This paper presents a framework for developing IIFSTs with two main goals: (1) to facilitate the development of the IIFSTs itself and (2) to foster cross-system cooperation among IIFSTs.

To account for the first goal, the framework follows a modular object-oriented design pattern to implement indexing, filtering, and searching operations. Using the Web as a case study, the prototype focuses on the development of robots to index Web pages, filtering systems to act on new pages and route them to potentially interested users, and searching engines to query collections of indices. Although the prototype focuses on the Web, it can also be used to develop IIFSTs to other collections of hyperdocuments.

Two classes of IIFSTs are publicly available in the Web: the service providers and the software providers. Service providers offer searching [3, 4, 5] and/or filtering [31] interfaces which the user directly accesses to request information. Software providers offer software for Web sites and intranets [6, 7, 8 , 32 ]. Some software providers also offer software development kits along with their software [9, 10, 11]. Although these kits offer callable subroutine libraries, they generally provide little or no extensibility [12]. As described above, our framework provides extensibility and reusability for the development of IIFSTs. In this paper, we discuss how to extend it and comment on case studies of its use.

The second goal is of special importance to systems intended to process huge amounts of data and which are willing to negotiate and share resources with similar systems. Such systems generally work to the limit of their computational power, so sharing resources to minimize redundant work is an important way to reduce costs [13]. There will also be savings in the global network load if groups of IIFSTs work together to avoid redundant indexing and filtering. Therefore, this approach is more scalable to network size and number of IIFSTs and users than the traditional centralized approach.

The cooperation among IIFSTs is being sought in various ways: standards have been proposed and are under development [14, 15]; broker-based Networked Information Retrieval systems have been implemented and evaluated [16, 34, 35, 36]; and other systems that make ad-hoc use of publicly available Web indices are also available [24, 25]. The framework proposed in this paper may be extended to incorporate the cooperation support functionality of a given standard or of a given system, thus making it easier to develop systems that comply with these desired characteristics. This facilitates the development of IIFSTs that cooperate with other systems to save computational resources.

The following sections present and discuss in detail the proposed framework architecture and show how it can be extended and used to support runtime cooperation. We also present and discuss results of case studies of the framework in use and add concluding remarks and comments on future work.

Framework architecture

Tools for indexing, filtering, and searching information in the Web share some commonalties that make possible the definition of a framework for developing such a kind of tool. The LIGHTS (Locating Information In Generic HyperText Spaces) framework described in this paper can be adapted in the process of analysis, design, and implementation of this sort of system. This framework is completely implemented using the Java[33] programming language.

Modeling Web content

As a first step to establish a usable framework, we need to know what kind of information does exist in the Web and how to identify it. The Web consists of a set of distributed objects, stored in a variety of formats and accessible via the HTTP protocol through an URL [29]. This protocol allows the definition of the objects' content types by using MIME (Multipurpose Internet Mail Extension) types specification [22].

Each MIME type has specific subtypes to identify object contents by using the format type/subtype. For instance, an object with identifier text/html has content (or type) text and format (or subtype) HTML. Different content types require specific handling mechanisms. It is desirable, however, that these mechanisms could be applied to current types and also to new emerging ones.

Figure 1 shows an object-oriented model of the Web content. This model is based on the abstraction of the many different content types that can be found in the Web and uses the Rumbaugh notation [28] for modeling classes and its relations. The WebObject class represents every object accessible through the Web (e.g., documents, videos, sounds, images, multimedia, applets, etc.). This class may be extended to include more specialized elements, such as HTML documents or GIF images. This class and other classes in the models are abstract in the Java context [33].

Figure 1. An object-oriented model for Web content

The Document class is a specialization of WebObject and a generalization of various MIME subtypes. This class is also abstract and is used to model different document formats in the Web. Documents are composed by terms, as indicated in the Term class. Instances of terms may be used: for example, to represent content keyword with associated frequencies and other attributes.

The appropriate handling of a document's content must be implemented by its specializations, such as HTMLDocument or PostscriptDocument, which must provide the proper content handler.

The ContentHandler abstract class represents a high-level interface through which it can be associated a content handler for each kind of object. This interface must be extended by appropriated content handlers that have to be developed for each content type. The interface is used in this framework to facilitate the WebObject to get the object content. Specializations of this class should implement methods to handle an image content, text, or any other content type. The object content will be obtained through the getContent method. It has to be noticed that this class was modeled based on the Java ContentHandler model [26], provided as a mechanism to obtain information from URLs.

Framework structure

As discussed above, various types of objects should be dealt with by the framework. In figure 2, the general structure of the proposed framework is given.

Figure 2. Framework structure

With exception of the DataBase class, all classes in this model are abstract. The DataBase class establishes an interface with a database system. This class creates the database handles and hides details related to how the connection must be defined and the database system used. This abstraction is possible, for instance, through the use of the JDBC standard interface to database systems [30].

PersistentObject declares methods to insert, update, delete, and query WebObjects in the database. Implementations of this framework must supply specific data structures in the database system to match the content types it should handle.

The WebObject class represents any object that can be stored in the database, with each type of object having its own content handler, as explained above.

WebRobot implements the active component of the framework that does the information indexing and/or filtering. This class basically works with sets of WebObject and PersistentObject instances, accessing objects from the network and processing their contents to extract information to be indexed/filtered.

Extending the framework

The process of extending the framework to implement real indexing/filtering systems consists of three basic steps:

The framework extension process is simple and flexible, achieved by doing the following steps:

Identify the objects to be modeled (i.e., the objects that are relevant to the application to be developed);
Map these objects to the corresponding classes on the framework, implement the abstract methods required by the selected framework classes, and add new specific methods if needed; and
According to the database system to be used, write a database scheme to hold the classes needed and then implement the corresponding abstract methods.

To illustrate how it works in practice, we use as an example the development of a Spider [27], which is a special robot to collect information from the Web for a variety of purposes, such as building an index for assisting users with keyword-oriented searching. Spiders make use of hyperlinks embedded in Web pages to automatically traverse the Web. The example will be limited to collect and process HTML documents. Figure 3 shows a general architecture of a Spider system; note that the spider component is just an entity in the Spider system.

Figure 3. A Spider system architecture

Identifying the objects

As a first step we need to do an object-oriented abstraction, in order to identify the collection of objects that are part of the system. For the Spider system in figure 3 we can identify three main objects: HTML documents, the spider component, and a database.

In this simplified model, HTML documents represent all documents that can be found on the Web and are formatted using the HTML standard. The Spider represents the core of the system; it will be in charge of processing the HTML documents. Finally, the database component is the element that will be used to store the information processed by the spider.

Mapping the objects with the framework

In the first step we identified the objects of the spider system; now we will map such objects to the framework. In figure 1 we can see that HTML documents correspond to the Document class that in turn is a specialization of the WebObject class. The Spider component is a direct specialization of the WebRobot class, and the database component could be a specialization of the PersistentObject class. These basic classes will be extended by implementing the abstract methods defined and adding other wanted characteristics.

HTML documents are specialized documents formed by structured text (HTML, JavaScript). For the representation of HTML documents we create a class, HTMLDocument, which is a subclass of the Document class from figure 1. To retrieve the contents of a Web page, we need to provide an implementation for the ContentHandler class as a subclass, HTMLContentHandler, in which we implement the getContent method to handle all steps (establish a HTTP connection, load the contents of the page, and so on) necessary to retrieve the page. When establishing a connection, getContent gives an identification to the Web servers; for example, we can use the following command to state the name and version: "User-Agent: Spider/1.0".

Next we implement the methods setURL and setContentHandler to configure the page with the URL and the ContentHandler. Finally we have to provide the implementation of the method parseContent, which will analyze the contents of the page and extract all the information needed (links, keywords, etc).

Having an object which represents the Web pages, now we need an object that can save and retrieve information hiding the internal structs of the DataBase. To get this we need to create a subclass of PersistentObject (see fig. 1) named HTMLPersistentObject. This class will be explained in the next section.

Finally we have to create the class Spider which is a subclass of WebRobot (see fig. 1). The most important point here is the implementation of the method goIndex, which controls the process of indexing. The first task that it has to do is to retrieve an HTMLdocument object that has not been processed from the database, using the method getNextObject. Next, it processes the content of the HTMLdocument object using the method parseContent and finally saves it into the database with insert method from the HTMLPersistentObject object.

Figure 4 below shows the hierarchy of the Spider system after extending the framework.

Figure 4. The hierarchy of the Spider system

The database

Finally, the last step is the design of the database that will store the information. The class HTMLPersistentObject, commented upon in the last section, will interface the database.

It has to implement the following methods:

insert, update, delete: this method will take an HTMLdocument object and save/update/delete respectively in the database.
getNextObject: this method takes an HTMLdocument object from the database to be analyzed. So the Spider will determine the next HTMLdocument object to be processed using this method. The heuristic of the robot will be defined in this method, so, depending on the implementation, we can have a depth-first algorithm or another approach.

It's very important to note here that if we want to change the database that we are currently using we don't need to make big changes in our project, because we just have to alter this class without altering all of our project. Before you implement this class, you have to make your database scheme that can hold all the information you need to save, no matter what kind of database you had chosen (Relational, Object-oriented, etc). We won't show the database scheme here because it's irrelevant.

Implementing cooperation

The framework just presented includes an object-oriented interface to implement cooperative systems. This interface allows systems to request and provide indexing, searching, and filtering services from/to each other by using Web Views (WVs) [1, 13] as an underlying partitioning scheme for the Web.

Web views and digital neighborhoods

A WV object defines a composition of Digital Neighborhoods (DNs) [2], each of which refers to a set of Web documents according to geographic, hypertext, network, and/or content criteria. For instance, a WV to cover all the documents about summer in Pernambuco, Brazil, could be structured as follows:

DN₁ = { type: content; concept: summer; radius: 3 }
DN₂ = { type: geographic; place: Recife, Brazil; radius: 200 }
WV₁ = ( DN₁ and DN₂)

In the example above, the attribute type defines the type of DN, and radius defines the region covered by the DN around its reference point. The region defined by a radius is calculated according to specific metrics for the given DN type. Thus, the setting "radius: 3" restricts DN₁ to documents with contents related to summer, according to some document similarity metrics. In DN₂, "radius: 200" may be interpreted as "documents stored in Web sites at most 200 km away from Recife." Furthermore, the notion of radius (i.e., distances) is related to hyperlink path lengths in hypertext DNs and to the IP protocol subnetwork scheme [17] in network DNs.

WVs may be built with arbitrarily long boolean expressions using AND, OR, and NOT operators, parentheses, and DN operands. This structure allows WV objects to determine inclusion relations and calculate intersections between pairs of WVs. The algorithms for these calculations are based on the direct comparability of DNs of the same type [2] and on reductions of boolean expressions to the equivalent disjunctive normal form [18]. The following example illustrates the use of these algorithms.

Let G_i denote a geographic DN, H_j a hypertext one, N_k network, and C_l content,
and let V₁ and V₂ two WVs, where: V₁ = ( G₁ OR H₁) AND ( G₁ OR C₁)
V₂ = R₁ OR ( C₂ AND C₃ ) The equivalent disjunctive normal forms for V₁ and V₂ are: V₁ = G₁ OR ( H₁ AND C₁ )
V₂ = R₂ OR C₄ Now suppose it is detected (by direct comparison between DNs of the same type -- i.e., content), that C₁ < C₄ (i.e., C₁ is included in C₄ ). It follows that ( H₁ AND C₁ ) < C₄ , hence ( H₁ AND C₁) is included in the intersection between V₁ and V₂ . Therefore, ( H₁ AND C₁ ) can be used as a cooperation zone between the systems whose WVs are V₁ and V₂ .

Provided that there are algorithms to determine intersections between WVs and that each IIFST may be configured to cover a given WV, then the WV formalism may be used to support cooperation between IIFSTs.

A simple Web View-based cooperation scheme

For a group of systems to cooperate, they need to (1) be able to communicate with each other, (2) know what services may be requested, and (3) know how to request these services. Our framework provides a simple cooperation scheme based on RMI [19] and WVs to support this functionality. A general view of the RMI-based architecture is shown in Figure 5.

Figure 5. RMI-based cooperation architecture

Each system that is willing to offer service should make available a server object of the class CoopServer via RMI. This object implements the CoopInt interface, outlined below:

    Interface CoopInt {
        Server[]   findOutActiveServers();
        WV         getWebView();
        Service[]  listAvailableServices();
        QueryResult runQuery(Query);
        boolean    regUserProfile(UserProfile, WV);
        boolean    deliverFilteringResult(User);
        boolean    regShareIndexing(WV);
        boolean    deliverSharedIndexing(IndexSet, WV);
        boolean    regShareDataBase(WV); }

The first two methods are intended to exchange control information among servers. The methods runQuery, regUserProfile, and deliverFilteringResult allow for the execution of ad-hoc retrieval and filtering tasks between servers. regShareIndexing and deliverSharedIndexing are used to share indexing efforts over a given WV; that is, the requesting server will stop running an indexing robot for the given WV and, instead, will periodically import these indices from the remote server. The method regShareDataBase registers the interest of the requesting server to make regular use of the remote server's database and issue frequent queries within the scope of the given WV. It requests permission to issue a potentially high number of queries regularly.

When seeking help to lower the local workload, a cooperating server will use the first three methods above to try to find servers that offer resources of interest, that is, try to find servers that (1) are active, (2) offer the services it needs, and (3) cover the desired WV (either totally or partially).

Extending the basic cooperation scheme

Given its modularity, and provided that it follows the object-oriented paradigm and is implemented in Java, the whole framework and, specifically, the cooperation scheme, may be extended in various ways. Some of the possibilities are

RMI: may be replaced by another communication mechanism (e.g., sockets or CORBA);
WVs: use other ways to define scopes for systems and users; and
Services: the list of services in the CoopInt interface may be appended or extended.

The same strategies to extend the general framework also apply to extending/reusing the cooperation functionality.

Case studies

An object-oriented architecture for integrating Web-based information systems with unstructured data

Traditional information systems store the information gathered using database systems that have some kind of structure (relational, object-oriented, hierarchical). Such structures impose some limitations when one has to process and store information that cannot be treated in a structured way. With the emergence of Web-based information systems, new alternatives to process information have been proposed, allowing for unstructured information of many types to be integrated into the systems, expanding the range of action of those systems.

We have designed and implemented an object-oriented architecture to integrate Web-based information systems with unstructured information. The architecture is based upon the client/server paradigm, with the server composed of three basic elements: a dedicated HTML page indexing subsystem built upon the LIGHTS framework, a query subsystem, and another subsystem responsible for updating and handling the data that is sent by the Web-based information system being integrated.

The dedicated indexing subsystem used in this architecture implements Single Term Indexing methods [23], meaning that the terms being processed are single words, with words performing grammatical functions (such as pronouns and articles) being eliminated using a generic Stop List structure [20] that supports English, Spanish, and Portuguese.

The query subsystem is based upon Boolean Searching [20], with queries specified as boolean expressions, composed of terms connected by the logic operators AND, OR, and NOT. The query model can be extended to support other query methods, and the results generated by the system are used to formulate more complex queries in the Web-based information system, acting as a client in the architecture proposed.

An information-filtering system for the Web

Recife is an information system designed to filter and recommend documents collected from the Web. This system is mainly composed of three modules: the first is the collection module, which is an extension of the LIGHTS framework. The second module has the selection functionality, and the third module is in charge of presenting recommendations and getting feedback from users.

At the moment, in the prototype implementation, Recife is working with HTML documents that are compared with a given user's long-term interest model. The recommendations are done based upon interests, user areas similarities, user-supplied documents, and the data gathered by the collection module. The similarities are calculated using a parameterized variation of the Vector Space Model algorithm [21].

The Recife system combines personalized and collaborative information filtering techniques, allowing users to share, evaluate, comment, and display the system recommendations. Users can also use the system recommendations as samples for future selections. The display module integrates facilities for viewing the system recommendation, editing the user model, and viewing recommendations made either by or to users that share some area of interest. The Portuguese interface version system is available at the URL http://www.di.ufpe.br/~recife.

Conclusions and future work

The framework presented in this paper provides an object-oriented approach that models Web content and establishes baselines to the construction of Web information indexing/filtering tools. The model has been successfully used in the development of such tools, validating its extensibility and flexibility. As shown in previous sections, the way the framework can be extended offers enough flexibility for developers to implement specific functionality for the working domain. Therefore, the framework can also be used as a basis to develop indexing and filtering systems to other hyperdocument collections.

The model allows cooperation in sharing computational resources as a means of reducing storage and processing costs. This functionality is especially useful in systems that have to deal with huge volumes of data. Future applications of this framework include geographical localization of information and Web pages and user surveys to be applied in electronic commerce.

The latest version of the prototype will be publicly available from http://www.di.ufpe.br/~bright/lights by the time of the conference and will be evolving through its validation of its basic indexing, filtering, and searching functions and also in more advanced applications such as electronic commerce and digital libraries.

References

Gonçalves, P.F., Salgado, A.C., Meira, S.L. "A Distributed Mobile-Code based Architecture for Information Indexing, Searching and Retrieval in the World-Wide Web." Proceedings of the INET'97, http://www.isoc.org/inet97/proceedings/A7/A7_2.HTM.
Gonçalves, P.F., Meira, S.L., Salgado, A.C. "Digital Neighbourhoods: Partitioning the Web for Information Indexing and Searching." In Olivé, A., Pastor, J.A. (Eds.) Springer Verlag, Lecture Notes in Computer Science 1250, pp. 289-302, June 1997, http://www.di.ufpe.br/~bright/publications/caise97.html.
AltaVista, http://altavista.digital.com/.
Excite, http://excite.com/.
HotBot, http://hotbot.com/.
AltaVista, "AltaVista Search Intranet eXtensions 97," h ttp://altavista.software.digital.com/search/5/index.htm.
Litras, S.W. "DB/Text Intranet Spider pulls your intranet together." Intranet World, August 1997, http://www.inmagic.com/reviews/rev-info.htm.
Excite, "EWS -- Excite for Web Servers," http://www.excite.com/navigate/prodinfo.html.
Fulcrum, "Knowledge Builder Toolkit -- developing custom applications," in Fulcrum Knowledge Network -- Product Overview, http://www.fulcrum.com/english/products/KNProd3.htm#05.
PLS, "Callable Personal Librarian (CPL)," http://www.pls.com/products/cpl/.
Verity, "Search Engine and Developer Kits," http://www.verity.com/products/customiz.html.
Watt. D. "Programming Languages Concepts and Paradigms," Prentice Hall, 1990.
Gonçalves, P.F.; Meira, S.L.; Salgado, A.C. "BRight: A Distributed System for Web Information Indexing and Searching," Proc. Brazilian Symposium of Software Engineering (SBES'97), pp. 331-345, Fortaleza-CE, Brazil, October 13-17, 1997, http://www.di.ufpe.br/~bright/publications/sbes97.html.
Gravano, L.; Chang, C.-C.K.; García-Molina, H.; Paepcke, A. "STARTS: Stanford Proposal for Internet Meta-Searching." Proc. of the 1997 ACM SIGMOD International Conference on Management of Data, 1997, http://www.bright.org.br/pfg/Qualif/Review_Dec97/ir/GravanoEtAl97.zip.
W3C, "Technical Reports and Publications," http://w3c.org/TR/.
Dupuy, F.; Pensivy, S. "Experience Report from a TINA Experimentation using ORBs and the Web." In Connolly, D.; Soley, R.M. (eds.) Joint W3C/OMG Workshop on Distributed Objects and Mobile Code. Boston, MA, 1996. html from BRight, html from w3.org.
Tanenbaum, A.S. Computer Networks. McGraw-Hill, 1992.
Chang, C.-L.; Lee, R.C-T. "Symbolic Logic and Mechanical Theorem Proving." Boston, Academic Press, 1973.
Sun, "Remote Method Invocation Specification," http://java.sun.com/products/jdk/1.1/docs/guide/rmi/spec/rmiTOC.doc.html.
Frakes, W. B. and Baeza-Yates, R.; "Information Retrieval, data structures and Algorithms." Prentice Hall 1992.
Lee, D. L., Chuang, H. and Seamons, K.; "Document Ranking and the Vector Space Model." IEEE Software Vol. 14, No. 2, March/April 1997.
MIME Media types specification. RFC 1521. ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/media-types
Gudivada, V. N., Raghavan, V. V., Grosky, W. I. and Kasanagottu, R.; Information Retrieval on the World Wide Web. IEEE Computing Vol. 1, No. 5. Pp. 58-68. September/October 1997.
MetaCrawler. http://www.metacrawler.com
Mother Load. http://www.cosmix.com/motherload
The Java JDK API: http://www.javasoft.com/products/jdk/1.1/docs/api/packages.html
Cheong, F.; "Internet Agents: Spiders, Wanderers, Brokers and Bots." New Rider publishing. 1996.
Rumbaugh, J., Baha, M., Premerlani, W., Eddy, F. and Lorensen, W.; "Object Oriented Modeling and Desing." Prentice Hall, 1997.
The HTTP Protocol. http://www.w3c.org/Protocols
Siple, M. "The complete guide to Java programming: JDBC, ODBC & SQL." McGraw Hill. 1997.
My excite. http://my.excite.com/
Recife Filtering System. http://www.di.ufpe.br/~recife/
Flanagan, D. Java in a Nutshell. O' Reilly & Associated. 1997
Gravano, L.; García-Molina, H. "Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies." Technical Note STAN-CS-TN-95-21, Computer Science Dept., Stanford Univ., Stanford, CA, 24pp., 1995.
Boles, D.; Dreger, M.; Grossjohann, K. "MeDoc Information Broker - Harnessing the Information in Literature and Full Text Databases." Proc. Networked Information Retrieval Workshop, SIGIR'96, 5pp., 1996. html from Oldenburg.
Crowder, G.; Nicholas, C. "Resource Selection in CAFE: an Architecture for Network Information Retrieval." Computer Science and Electrical Engineering Dept., Univ. of Maryland, Baltimore County, MD, 1996.

INET Conferences

A Framework for Developing Information Indexing, Filtering, and Searching Tools on the Internet

Abstract

Contents