Louis-Claude Paquin, Ph.D., Professor
Associate Member of the UNESCO-BELL Chair
for Communications and International Development
Université du Québec à Montréal
This paper offers a solution to the problem of information overload caused by the massive flow of documents on the Internet. The volume keeps increasing to the point where it is now calculated in terabytes (one page equals 2 kilobytes). On the other hand, our ability to process and absorb information remains limited, resulting in an exformation phenomenon (i.e., a collection of unprocessed information for want of time and skill). Worse still, the increase in the volume of information can lead to focus reduction, a tunnel effect. Of course, many systems are available to help locate documents through key words. Some, like Lycos, are remarkably effective, considering the wide volume they cover. However, this ease of access to documents obfuscates the difficulty of finding relevant information. How can one be sure that all relevant documents have been scanned? Why are so many irrelevant documents ferreted out?
Analytical processing is unavoidable. If the operation is not carried out before the documents are circulated, it has to be done by users who must keep reformulating their request and scanning vast amounts of extraneous material to achieve a result that is not necessarily satisfactory. Information scientists agree on the need to prepare the documents to be circulated by marking their logical structure (parts, sections, subsections, etc.) and conceptual indexing. Both forms of document preparation have until now been carried out semi-manually by professionals, entailing substantial costs and delays.
The Information Communication Assistance System (ICAS), which we are now developing, is a work flow-i.e., a set of computer procedures designed to speed up and support human decisions in the low-cost processing of existing documents to allow satisfactory access to their content. ICAS helps perform two operations: (1) converting the word processor's typographical codes (character type, paragraphs, indents, tables, etc.) into information on the documents' logical structure and (2) revealing the terminology used in such documents, i.e., all terms-often made up of several words-designating concepts in the reference field.
The identification and marking of the documents' logical structure are achieved by the following modules:
The approach to reveal terminology can be called "mixed" because it is based on the sheer force of an algorithm for the statistical detection of co-occurrences, tracking the frequency of certain word groups within a given documentary space, to which applies a restriction for opposing morphological categories. The expressions are arranged in ascending order from the shortest to the longest one. A hypertext-type interface makes it possible to navigate from one term to another and forward the selected term to the server's research module.
In short, to allow for the consistent, faster preparation of documents by less skilled users, ICAS uses the techniques of uncertain reasoning, supervised learning, direct-handling graphic interfaces and automated processing of natural languages. This research and development project, estimated at $1.5 million, has been granted a $500,000 subsidy by the Quebec government. Here is a detailed description of the ICAS modules and their underlying assumptions:
The converter of binary formats characteristic of word processors feeds the processing chain. The converter's function is to standardize the format of any given product and make it legible to the analyzer. The format resulting from the conversion of a binary file is an ASCII file with SGML markers. The conversion applies to both accented characters and typographic codes: justification, bold, italics, tabs, indents, font and size changes, etc. This module also converts tables by turning the codes of the matrix structure (cells, lines, columns) into SGML. It converts vectorial formats (diagrams and graphs) into bitmap images after extracting and marking out character strings for location purposes. The module is based on the following assumptions: An SGML conversion of the codes mentioned above can be achieved without any loss of information, (2) a common, independent marking of programs makes it possible to standardize any subsequent processing, and (3) the character strings contained in the first line and the first column of tables, as well as in the graphs and diagrams, provide a very good indication of the concepts processed.
The use of SGML to mark out a document requires a document type description (DTD) defining the marks and their syntax for a given class of documents. Thus, a DTD will indicate that the document includes parts, that the parts include sections, that the sections include an optional heading and subsections, which include paragraphs. The conception of the DTD is a complex operation that calls for training in formal languages and textual analysis. It is a major obstacle, if not to the use of SGML, at least to the maximization of its potential. Most of the time, DTDs are arrived at by altering existing DTDs. DTDs are very difficult to read and understand because they are made up of a series of definitions, first of simple components, then of increasingly complex components evolved from less complex components.
The idea of ICAS is not to design a module to make any DTD applicable to any type of document structure or component. The only structure we are interested in is the logical structure of the document-i.e., its subdivision into parts, sections, subsections, etc. The scope of the problem is thus reduced to proportions that make it possible to put together a listing of relevant categories and design a module of assistance to the creation of a DTD for the logical structure of documents. This module includes the following elements: a direct-handling graphic interface to identify the document's various structural components, a generator of conversion rules, and a definition compiler of the DTD components.
The graphic interface allows the display of an ASCII file with typographic codes bearing SMGL markings. The user merely has to select a part of the document corresponding to a component and identify the category and attributes with the help of menus. The attributes determine whether the component is compulsory or optional, unique or repeatable. Thus, a section may or may not include a heading, but this heading is always unique. The categories already identified are displayed in the left margin for confirmation. This identification operation ranges from the simplest to the more and more complex component. From this categorization, two processing operations are carried out: the composition of the DTD and the basic conversion rules of typographic codes into components of the logical structure of the document. Here is an example of the conversion rules: A stand-alone character string followed by a paragraph is a subheading; a stand-alone bold character string followed by a paragraph or a subheading is a section heading.
The design of this module is based on three assumptions: (1) The technique of learning by example is the best strategy to facilitate the development of DTDs and the construction of the rules to convert typographic codes into the logical structure, (2) a direct-handling graphic interface is the most effective tool of learning by example, and (3) it is possible to build a DTD compiler from a formal syntax.
The body of documents to be distributed makes up a documentary space. This space can be ordered by a classification scheme that can be of benefit in accessing document data. To classify documents, this module is equipped with a direct-handling graphic interface allowing the registration of a new document in the documentary space represented by a tree. This module can also be used to reorganize an existing documentary space. The documents are represented by icons and the intermediate levels of the tree by another type of icon. Possible editing operations are the insertion and deletion of a document, the creation and deletion of an intermediate level, and the moving of documents or intermediate levels. Thus, when a subgroup of documents, represented by a portion of the tree, is selected and moved, all documents belonging to this subgroup are selected and moved in the same way. Focusing operations can facilitate the handling of the tree: backward zooms hiding the lower part of the tree to offer an overall view, and forward zooms that, conversely, allow a close view of part of the tree. The basic assumption of this module is that taxonomies are easier to grasp and manage in graphic form.
This module automatically performs on the documents these three operations: (1) segmenting the components of the logical structure, (2) identifying and marking the components of the logical structure, and (3) identifying references. These are carried out from the DTD and the rules of conversion of typographic codes into components of the logical structure achieved by the learning module. It should be recalled that the conversion rules take typographic codes as a premise: If a change in character type or size or a negative indent is noted, it will be marked as a "subheading."
The assumption at the basis of this module is that the use of the techniques of uncertain reasoning allow for tolerance in triggering off the rules of conversion while providing an indication of "certainty" that will guide human intervention for later corrections. Thus, the analyzer is an expert system equipped with a cumulative certainty enabling it to work in "background noise" situations, i.e., when the typographic signs necessary to detect a component of the structure are not all there or when many conflicting interpretations are plausible. An example of a "background noise" situation is a segment of a line and a half. It can be a paragraph if subheadings do not include digital identifiers, use the same font as the rest of the text and start on the next line like paragraphs. Uncertain knowledge expert systems can vary their diagnosis of such situations and assist manual disambiguation. Thus, in the previous example, the expert system would determine a heading with a 75 percent degree of certainty and a paragraph with a 50 percent degree of certainty. With an expert system, it is also possible in such situations of conflict of interpretation to formulate metarules that take into account past decisions, probabilities of occurrence, etc.
Here is a diagram showing the operations performed by the analyzer:
In addition to segmenting the document into components of the logical structure and identifying and marking them, the analyzer tracks down references. The tracking is done through rules also evolved from learning. This learning is achieved by selecting expressions indicating a reference. The expressions are converted into patterns applied to the document by the analyzer. For instance, the word "article" or its abbreviation "art.," in italics, followed by a number, indicates a reference to laws, regulations, contracts, or collective agreements. As it does for the components of the logical structure, the accumulation of certainty ratios allows the detection of potential references in "background noise" situations. In addition to being marked, the references detected are recorded in a table that will allow, once the segment corresponding to the reference is processed, to establish a link.
This interactive module's function is to validate the components of the logical structure identified and the markers attributed to them in the automatic analysis performed previously. Its function is also to add indications, which could not be included automatically, giving access to data. These indications are associative links with other segments of the same document or other documents in the same documentary space.
This module's interface is designed to facilitate human intervention, as much through ergonomy adapted to the task as through contextual aid. This interface includes a segment display area with a scroll bar to validate the segmentation. If the segmentation is unsatisfactory, the user can merge several segments into one or break up a segment into many components through block selection. In the latter case, the analyzer is applied to the new segments to edit the marking. In addition, the references detected are highlighted in the segment text and the user can either invalidate them or add some. The other areas of the interface are devoted respectively to a sequential segment identifier used for addressing links that may have been established with other segments, indicating the segment level in the document structure, the heading of the segment and the identifiers of the linked segments.
The segment heading, if present and detected through typographic signs in the document, is displayed in the area assigned for this purpose. Otherwise, the area is left in blank. The user can accept the heading detected as is or change it. It is important to note that most of the time, the headings given to document subdivisions are ambiguous and incomplete. They are usually given for the sake of concision within the context of a linear reading of the paper document. The subheadings given to segments are added to the subheadings and headings of sections sometimes even to the headings of parts. The interface includes options enabling the user to add to the segment heading one or more headings of higher levels in the structure. It is to be noted that when a heading is changed, the original heading is preserved for the integrity of the archive.
Thus, the table of contents offered upon accessing the information is made up a posteriori and does not correspond, in most cases, to the table of contents of the paper version. In addition to the compositionality of headings discussed earlier, the tables of contents of paper documents very seldom reach down to the segment level. They indicate broad divisions and leave it to the reader to peruse them to find the segments.
A final area contains the identifiers of the other segments linked to the segment on display. These links are established individually by the user. To do this, the validation module offers the facilities of a document server to inquire and navigate in the documentary space being built. The user can make an inquiry either from the terminology (cf. following modules) or from the table of contents. Following an examination of the passages located, the user can establish a link between certain segments and the segment being checked.
The assumptions at the basis of this module are the following: (1) An entirely successful automatic text analysis is impossible, (2) it is much easier and faster to do a check through a direct-handling interface, (3) the headings of components of a paper document must be reviewed for electronic distribution, and (4) associative links cannot be automatically established satisfactorily on the basis of identical character strings unless the conceptual relation is very strong.
The purpose of this module is to offset the shortcomings of data research on a server through the words of the document at the source of "silence" and "noise" without resorting to complex solutions such as conceptual indexing. The underlying assumptions are the following: (1) A terminology survey listing multiterms enables the user to restrict "noise" during data research on a server from the words of the document, (2) a mixed strategy for the location of multiterms, based at once on a calculation of the rate of occurrence and a restriction of morphological categories, is more effective, and (3) only the user is in a position to validate relevant multiterms.
The preferred solution lies in a survey of the terminology present in a given documentary space. By terminology, we mean all terms designating concepts in the field of reference. There are two types of terms: uniterms and multiterms. Uniterms corresponding to one word are the least frequent and serve to form many multiterms; hence, the "noise" in locating information. Multiterms are expressions composed of many terms whose meanings are different from their individual meanings. The expression "word processor" is a good example of a multiterm.
The strategy employed in this original approach can be called "mixed" in that it is based on the brute force of a co-occurrence algorithm to which is applied a restriction of morphological categories. Co-occurrence is the sequential or simultaneous repetition of a certain group of words in a given documentary space. For every pole word in this documentary space, the co-occurrence algorithm finds all the expressions in which the word appears at a frequency exceeding an arbitrarily set threshold. The expressions are ordered from the shortest to the longest. The results of a co-occurrence algorithm are very bulky since the same expressions come back for each of the words forming them. Thus, in the previous example, all expressions containing the pole words "holidays," "right," and "taken" will be listed.
All multiterms are tracked down by a co-occurrence algorithm. However, they are not the only expressions tracked down. In fact, multiterms form but a low percentage of the expressions tracked down. They can be considered as co-occurrences that have a complete meaning by themselves. This is why a restriction must be applied to the co-occurrences tracked down. The first restriction is applied at the time of detection through the use of an antidictionary that avoids the creation of lists of expressions for words that cannot designate a concept, like conjunctions, articles, and grammatical words. The effect of this restriction is to reduce considerably the calculation time and the bulk of the results. The second restriction aims at eliminating pole words that do not designate concepts and insignificant or incomplete occurrence contexts. To do this, the results are screened by a morphological analyzer. In every category, all words that are not common or proper nouns are excluded from the pole position. Then are excluded the expressions whose formation is incomplete or inappropriate, i.e., whose initial word's morphological category is neither a noun nor an adjective and whose morphological category is not a noun, an adjective, a participle, or an infinitive.
The proposed approach is completely automatic and, therefore, does not entail any labor cost. This approach would be unsatisfactory if it were aimed at making up a list of terms in the field because other factors must be taken into account to select well-formed terms. However, it is particularly well adapted to the formulation of a data search on a document server because it enables the user to restrict the "noise" (i.e., irrelevant segments) at will. Indeed, choosing a short multiterm will result in detecting a larger number of segments while choosing a long multiterm will focus the search.
The result of the previous terminological analysis is an alphabetical list of terms, called main expressions or headers. A list of occurrence contexts, called subordinate expressions, is attached to each of the main expressions. The terminology navigation module provides an interface that facilitates the handling of the list of main expressions and the list of subordinate expressions attached to them. The screen is divided into two windows, each including a scroll bar to locate the desired expression. The main expressions are displayed permanently in the left window while the subordinate expressions of the main expression selected appear in the right window.
To quickly locate a main expression, the user keys in the initial letters of the expression. When the main expression is located, it is displayed in the left window and the list of subordinates of this expression is displayed in the right window. In the subordinate expression, the main expression is replaced by two dashes. While viewing the various subordinate expressions, the user can, even before formulating a request, make a restriction that will save time in validating the segments obtained following a request to the server. When a subordinate expression is selected, it is displayed in the lower area. To launch a request to the server for all segments containing the selected subordinate expression, the user has only to validate the selection.
The interface also makes it possible to select in the lower area displaying the full subordinate expression a term that will become the main expression and whose subordinate expressions will appear in the right window. This navigation tool enables the user to enter the terminological universe of a given corpus of documents. Likewise, the user can access information whose exact terminology may not be known. This module is based on the assumption that, language being very productive, it is very difficult to predict a priori the multiterms of documents that will be deemed relevant. It is better then to have a tool that facilitates navigation through a list of candidates and to have this tool be handled by the user, who searches the information and who can recognize the multiterms of interest.
The server's feed module comes in at many levels and deals with many types of documents, all marked by SGML: document segments following validation; the tables of contents of documents following the validation of all segments; the classification plan of the documentary space after processing all the documents; and the terminology of the documentary space after processing all the documents. Once the indications added to the segment by the analyzer are validated to the user's satisfaction and links have been added, the segments are added to the server and the heading of the segment is added to a table of contents. Once an entire document has been processed, the table of contents is added to the server. Once all documents have been registered in the classification plan, analyzed and validated, the classification plan is added to the server. Likewise, once the terminology has been extracted, the result is added to the server.
Here, in conclusion, is a diagram showing the ICAS modules and