The Authors Catalogue Their Documents for a Light Web Indexing

Davide Musella <davide@jargo.itim.mi.cnr.it>
Computer Science Department
Università degli Studi, Milan, Italy
Tel. +39 (0)2 70643271

Marco Padula <padula@nerve.itim.mi.cnr.it>
Istituto per le Tecnologie Informatiche Multimediali
CNR, Milan, Italy
Tel. +39 (0)2 70643271

Abstract

The growth of the World Wide Web (WWW) has unleashed a quantity of software and information resources that has generated in turn a demand for further development. This positive feedback calls for tools that can index the mass of information now available to users so that it can be fully exploited. Current mechanisms for information indexing lack the necessary support in cataloguing that could be supplied by a strict definition of meta-information and adequate tools for its circulation. This paper examines the potentials of Hypertext Markup Language (HTML) document cataloguing. On the basis of an investigation of the definition and implementation implications of this method, we propose a grammar for the HTML 3.0 META Tag, some guidelines for handling meta-information for Hypertext Transfer Protocol (HTTP) servers, and some rules so that search agents utilize only this meta information to index documents, decreasing data flow across the Net. We focus our attention on the retrieval and cataloguing of documents available in WWW.

1. Introduction

The synergism of current technologies for digital data processing and management, networking, multimedia and hypermedia systems, and user-system interaction has spurred the development of working environments that go beyond the simple satisfaction of user needs: the electronic memory is being replaced by the information network with all its circulating and concentrated resources for information and computation, to which every user may contribute, modifying and updating data, and offering his own products. This produces dramatic innovations, both technological and social, regarding the sources of information; the instruments and procedures developed and sold for their processing, organization, and circulation; and the role of the persons and institutes dedicated to these activities. The macroscopic aspect of all this is perhaps best seen in the development of the WWW environment for the use and management of Internet resources (Internet memory): In astonishingly little time, the number of documents available has increased to close to several score million. With the creation of systems for the automatic handling of information, information science has entered the institutions (libraries; museums; publishers of newspapers, periodicals, and books; television; cinema) involved in the organization and dissemination of our social memory, bringing powerful tools to support them in their undertakings. Today the situation is again changing; now it is these institutions that are projected into the world of circulating information, where they can satisfy their information needs and with which environment they partially overlap. Their evolution, passing through the various phases of automation of archives management, is now moving toward the construction of management systems that integrate the autonomous archives found in the various nodes of the network and give the user the hypertextual techniques to reorganize documents. At this stage, it is the user who takes charge of organizing knowledge according to a cultural project or logic that he himself chooses or establishes [1]. It is evident that this requires the conception and construction of new tools, for searching and collecting documents of interest on the one hand and for personalizing information on the other [2,3].

The lack of efficient tools for the recovery of the desired information would mean the collapse of the organization of memory. Modern cataloguing methods identify documents with parameters such as author, title, and subject, providing the user with various accesses for document retrieval [4]. The Internet Memory could become the basis of an institution for the organization of our collective social memory; what is lacking most is a satisfactory method for cataloguing documents and the tools for handling such a mass of data. We focus here on the problem of cataloguing and retrieving documents on the Internet.

As was the case in conventional information retrieval [5], two approaches, the automatic and the manual, are currently taking shape. In the first, documents are collected, classified, and indexed by human operators and then organized and circulated by hypertextual instruments (virtual libraries, catalogues). But the enormous amount of data available, as well as the rapidity with which these change and their short duration, make manual cataloguing unfeasible, considering the human resources that would be required in highly specialized domains alone, especially for archives updating.

As for the automatic approach, some tools, called search agents or robots, do exist for network information retrieval, but both the extraction of the data that define the contents of the document and the wide bandwidth used by these robots present problems.

Almost all the available robots examine all the documents that can be accessed from an initial sample of documents and generate an indexing structure through their full-text processing [5]. It is, therefore, necessary to retrieve the entire document to be indexed, and as the agents must complete the cataloguing in a short time, they absorb a great quantity of network resources for processing, increasing the network workload to critical levels.

Moreover, notwithstanding the large amount of data moved, the results obtained are only approximated, due to the poor characterization of the documents examined. This lack of proper cataloguing results in the circulation of poor quality information. What is needed is a cataloguing method that assigns documents keys that concur semantically with document contents, increasing the quality of the information in circulation while decreasing its entropy.

Some partially successful attempts have been made to formulate behavioral rules [6] for the agents so that they cannot overload the HTTP servers [7]. Intelligent agents for automatic searching [4] that can cope with the infinite variety of structural patterns in different documents represent a feasible way of solving this problem.

2. Light indexing

The WWW has become a fascinating gateway to telematics and the electronic publishing world for beginners, who no longer need to present the expert credentials that only recently distinguished Internet users. The WWW has become a phenomenon of such proportions due to the strong demand of these new end users, who can now access the very latest technologies. Their needs must be taken into account when designing new tools or modifying existing ones. Consequently, the modifications in WWW document management tools that we propose here ask of the authors, who cannot be presumed to be skilled in computer sciences, a simple cataloguing activity regarding their own documents, reserving to the HTTP servers and robots the more technical tasks.

Some proposals for HTML document cataloguing that leans toward lighter indexing (i.e., that controls the network workload) have already been discussed, as in Ragget [8], where the HTML META tag, which summarizes document contents, has been defined. Unfortunately, these proposals generally lack a strict grammar that precisely defines terms and meanings. Moreover, they do not explicitly include the META tag in the cataloguing method. That explains why the META tag is neither used by document writers nor managed by HTTP servers. And why, as a consequence, the authors of robots have adopted other mechanisms to create indexing structures and limit data flow.

Cataloguing based on the META tag does not require that the author of a document be adept in formalizing document contents. It is more effective than automatic cataloguing performed by search agents, and when coupled with a simple indexing task that does not make use of tables or reference lists, it can provide for retrieval that is both effective and efficient.


Figure 1. HEAD and GET requests.

Indexing based on META tag interpretation moves few data, and these can be attached to those normally transmitted in reply to a HEAD HTTP-request (which refers to the properties of a file and not to its contents). In fact, a reply to a HEAD request contains information such as Content-encoding, Content-type, or Last-modified-date; integrating this with META tag contents produces a data pack for satisfactory document indexing (Figure 1).

The kernel of the problem is how to get the writers to catalogue their documents and include the description of the contents in the document itself, and how to have the servers support the robots by transmitting only the description, not the whole document.

The reluctance of writers to commit themselves to this task could be overcome by editors that require, and assist in, the compilation of the META tag; by HTTP servers that filter the documents, blocking those that have not been catalogued; and by browsers that ignore or warn of ill-composed documents when viewing is requested.

Such a large number of different editors are available today that the idea of having all the models currently in operation modified to provide the desired constraints is admittedly highly improbable. But there are only two or three browsers sharing the entire market, so it would be feasible and affordable to agree on the adoption of a standard behavior.

To render such a policy effective, we propose a procedure, suited to various contexts, that, besides decreasing disk and CPU usage, can deal with servers' different needs and in the future be extended to similar methodologies applied to other communication protocols and to file formats other than the HTML.

3. Guidelines for designing the protagonists of the indexing process

Our proposal covers three main aspects:

3.1 The META tag

As we are concerned here with HTML documents, we focus on the META tag as defined in HTML 3.0 [8]. The purpose of the META tag is to insert meta-information into HTML documents, but its present constraints allow the document author to formulate instructions that may be ambiguous to a parser. For example, in the following cases,

    <META HTTP-EQUIV = "author" CONTENT = "Pennac, Rossi">

    <META HTTP-EQUIV = "writer" CONTENT = "Pennac, Rossi">
the same meaning can be associated with both instructions, despite their different syntax. To prevent this, we have defined a minimal set of properties of the HTTP-EQUIV attribute with which to catalogue a document.

The HTTP-EQUIV attribute has been defined to bind a property to an HTTP response header. HTTP servers read the content of the document HEAD, generate response headers corresponding to the properties specified by the HTTP-EQUIV attribute, and assign each property the value specified by the CONTENT attribute. This provides document authors with a mechanism for identifying the information that should be included in the response headers of an HTTP request.

If the semantics of the HTTP response headers is known and unambiguous [10], the robots are enabled to interpret and exploit them on the basis of a well defined mapping, whether or not the DTD includes anything about it.

HTTP header names are not case-sensitive and can be specified by any text string, but, to avoid the ambiguity of the previous example, we have defined the following set:

The attribute CONTENT is used to assign a value to a property. If it is used with the HTTP-EQUIV, its value may result from a Boolean expression. The AND operator is represented by the SPACE (ASCII[32]) and the OR operator by the COMMA (ASCII[44]). The AND operator is processed before the OR operator. The spaces between a comma and a word are ignored, for example in

    <META HTTP-EQUIV= "Keywords" CONTENT= "Italy Product, Italy Tourism">
the expression Italy Product, Italy Tourism means Italy AND (Product OR Tourism).

The document's author may introduce further properties; he must simply avoid using terms like Content-Encoding, Expires, and Last-Modified, which are typically generated by the HTTP server and included in the meta-information of its reply to a HEAD request.

The information contained in the TITLE Tag could be used to index a document and therefore must be inserted as a META tag property (using "<META HTTP-EQUIV="title" CONTENT ="This is the title"> for example); but, to avoid useless redundancy, we prefer not to insert a specific property for the META tag grammar, but to consider this tag a real META tag.

Server implementation must be congruous with the previously presented set of properties and resolve possible conflicts by ignoring any META tag to which an HTTP-EQUIV attributes a property reserved to response headers.

3.2 The HTTP servers

HTTP servers can be redesigned to integrate a HEAD reply with meta information principally in two ways. In the first, the HEAD of the requested file is parsed to extract the document's meta-information. Meta-information of more frequently requested documents (the frequency threshold being defined by the user) may be cached to decrease the workload; the server will process only documents not listed in the cache memory. In the second design, the meta-information is read from a file created by the server extracting the meta-information from the document, or duplicating it. A single meta-file can be produced for each document, for all the documents in the same server subdirectory, or for all the documents managed by the server.

At present the general functioning of an HTTP server provides these operations:

  1. Receives request
  2. Searches for the requested data
  3. Transfers the reply according to the requested procedure (HEAD, GET, etc.);

In particular, if the request is of the HEAD type, operation 2 becomes an extraction of the properties of the file, such as file length, last modification date, content type.

We propose that this operating sequence be modified as follows (Figure 2):

  1. Receives HEAD request
  2. Searches for requested data
  3. Accesses the meta-information
  4. Extracts the information to form a reply of the HEAD type (file length, last modification date)
  5. Includes the meta-information among the fields to be sent in reply to the HEAD request
  6. Transfers the reply

The HEAD reply formulated by currently available servers includes information about a document that does not require its opening, since it does not concern document contents.


Figure 2. Management of a GET request, management of a HEAD request, and proposed new method for the management of a HEAD request.

But to retrieve the data in the META tags, the document must be opened and parsed to extract the meta-information. As this operation could overload the server, we have designed a method to avoid it. The following procedure is based on the duplication of the meta-information and its storage in a file external and specific to the document. This procedure puts the author in control of document cataloguing (which remains in HTML format), but requires that he manage only one file.

  1. The author has completed his document and catalogues it.
  2. The author calls for document submission to offer the document to WWW users; if some meta-information is lacking or ill formed, the HTTP server sends a diagnostic message.
  3. After updating, the document must be submitted again.
  4. Upon receipt of a document submission, the server parses the document, extracts its meta-information, and stores this in a file, which is in the same directory as the document, can not be modified by the author, and is invisible to him. The file has the same name as the document, with the extension .meta.
  5. The server updates the SFT (Submitted Files Table) with references to all the submitted documents. An unsubmission command is used to remove a document from the list of subitted docs. The creation of the SFT.txt file has the advantage of increasing the security of the server, since it makes available only documents that the authors have expressly made public. The SFT may be obtained by any robot that requests it.
  6. After receiving an HTTP request, the server checks to see if the document has been submitted. If it has, a GET response is composed with the whole document or a HEAD response with the information in the .meta file. If the document has not been submitted, or if it has been erased but not unsubmitted, the response is an error message.
  7. The server periodically runs a crono process to update the .meta files and correct irregular situations.

As far as the SFT.txt includes references to files with different formats, the proposed method could also be adapted to index non-HTML files (texts, images, programs) [13] for which a cataloguing procedure has been defined and, consequently, a .meta file created.

We have to date considered the aspects of indexing documents stored on the server, but this is not the only task of a robot. It must also have the list of the external links contained in these documents to be able to compute the unknown address of the servers to be inquired.

To avoid parsing a document requested by a robot in order to extract external links, the server must also create and maintain an RSA.txt (Reachable Server Address) file, which collects the addresses of all the documents managed by a remote server. Step 4 of the previous procedure must be modified to include the extraction of the links from the submitted document and the RSA.txt updating; step 7 must be modified to verify irregular situations that could involve a RSA.txt when a submitted document is simply removed without an unsubmission.

3.3. The robots

To fully exploit the potentialities of the proposed indexing methodology, the request mechanisms of the robots that are the WWW search agents [14,15,16] must be adapted. They formulate GET requests to obtain the SFT table and the RSA list so that they can plan their navigation through the WWW and HEAD requests to collect information to store in their indexing structure.

They do not saturate the band with the documents requested, while also lessening the load on their own machine, since it is no longer necessary to perform a full-text analysis to extract the characteristic information from the document.

To fully exploit this methodology, the robot executes the following steps:

  1. Requests the robot.txt file to verify the standard for the exclusion of robots [17] and, consequently, the possibility of access
  2. Requests the SFT.txt table, in order to have the complete list of HTML files to index on the WWW server
  3. Requests all the .meta files associated with the documents referred to in the SFT table
  4. Requests the RSA.txt list in order to plan navigation beyond the server
  5. Passes to a new server

If the SFT.txt and RSA.txt files are missing, the robot does not index that particular server. In this way the robots function much more rapidly and obtain results that are indubitably superior to those that can be achieved with conventional means.

Besides enhancing the speed and correctness of the indexing procedures because they are based on standards, we believe that the implementation of the methodology presented here would simplify the code used by indexing agents.

4. Future developments

We believe that the future of cataloguing will be the extension of the methodology proposed here for the WWW to more generic contents of the Internet. The concept of meta-information is one that must be extended to all types of files and documents, so that the information needed for indexing may be extrapolated from any type of file.

This will surely mean that the modalities followed by the author of a document for its cataloguing must be the same for any type of content, without, however, introducing too many complications. Any operation performed by the user must be easy to apply, since the Internet, and the WWW in particular, will increasingly have users without specific technical qualifications.

A HEAD response should also be implemented, with the inclusion of the meta information for the various protocols such as FTP, NNTP, GOPHER, and others used for the management of the documents in the network.

We believe that generalizing the methodology presented here is a step of extreme importance in the direction of a global information system. Regulation of this kind is not only desirable, it is a necessity. The network, daily traversed by innumerable requests advanced by robots, demands it. The quality of the information trasmitted demands it: Information that cannot be traced cannot be known and remains an end unto itself.

We suggest that the proposed modifications would be a small price to pay for efficient service.

References

  1. M. Ricciardi, "Testi virtuali e tradizione letteraria," in G. Baldissone (ed.), Biblioteca: Metafore e progetti, Angeli, Milan, 1994.
  2. D. Berleant, H. Berghel, "Customizing information: Part 1, Getting what we need, when we need it," IEEE Computer, vol. 27, no. 9, pp. 96-98, 1994.
  3. D. Berleant, H.Berghel, "Customizing information: Part 2, How successful are we so far?" IEEE Computer, vol. 27, no. 10, pp. 76-78, 1994.
  4. J. D. Bolter, The Computer, Hypertext, and the History of Writing, Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1991.
  5. G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill International Book Company, Singapore, 1983.
  6. M. Koster, "Guide for Robot Writers," Nexor Corp., http://webnexor.co.uk/mak/doc/robots/guidelines.html.
  7. Special issue on Intelligent Agents, CACM, vol. 37, no. 7, 1994.
  8. D. Ragget, "Hypertext Markup Language Specification 3.0," work in progress of the HTML working group of the Internet Engineering Task Force, http://www.w3.org/pub/WWW/MarkUp/html3/Contents.html.
  9. T. Berners-Lee, R. Fielding, F. Nielsen, "Hypertext Transfer Protocol," work in progress of the HTTP working group of the Internet Engineering Task Force, http://www.w3.org/hypertext/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html.
  10. D. Musella, "The META Tag of HTML," draft-musella-html-metaTag-02.txt, work in progress, http://jargo.itim.mi.cnr.it/documentazione/draft-musella-html-metaTag-02.txt, January 1996.
  11. ISO639 Language Codes, "Codes for the representation of names of languages," 1988.
  12. ISO3166 Country Codes, "Codes for the representation of names of countries," 1988.
  13. T. Krauskopf, J. Miller, P. Resnick, W. Treese, "Label Syntax and Communication Protocols," work in progress, http://www.w3.org/pub/WWW/PICS/labels-960303.html, March 1996.
  14. D. Eichmann, "Ethical Web Agents," Proceedings of the Second WWW Conference, Chicago, October 1994.
  15. D. Eichmann, "Advances in Network Information Discovery and Retrieval," submitted to International Journal of Software Engineering and Knowledge Engineering.
  16. O. Etzioni, N. Lesh, R. Segal, "Building Softbots for UNIX (Preliminary Report)," University of Washington, Seattle, November 1992.
  17. M. Koster, "Standard for Robot Exclusion," http://info.webcrawler.com/mak/projects/robots/norobots.html