Christian Fluhr <firstname.lastname@example.org>
Dominique Schmit <email@example.com>
Commissariat à l'Énergie Atomique
This paper describes the architecture and working principles of a system that can index documents (and parts of documents) in several languages and is able to retrieve information using one single-language query. The system is based on an automatic linguistic process of both documents and queries; general unilingual and bilingual reformulation rules are used to reduce the gap between user needs expressed by the query and document contents. The system includes an interface between HTTP and SPIRIT servers and is accessible through standard WWW clients. Design methodology and architecture details of this type of interfaces are also presented.
Keywords: information retrieval systems, multilingual full-text databases, cross-lingual interrogation, language engineering, distributed full-text databases on the Internet, database gateways.
Many full-text corpora consist of documents in more than one language. This is the case for scientific and technical documents from non-English-speaking countries and for documents in full-text databases that are generally in at least two languages. It is also the case for countries that have several official languages (Canada, Belgium, Switzerland, Tunisia, European Union, etc.). On the Internet, interesting information can be found on pages not written in English, and it is necessary to have systems that can effectively index texts in several languages and that enable users to search through texts of various languages in their mother tongue.
From 1990 to 1994, a European ESPRIT project, European Multilingual Information Retrieval (EMIR), demonstrated the feasibility of cross-lingual interrogation. The languages accepted were French, English, and German. Russian is now being implemented with UNESCO funding and Dutch will follow, with Belgian funding. From 1994 to 1996, the results of this project were integrated into a commercial client-server information retrieval system, Syntactic and Probabilistic Indexing and Retrieval of Information in Texts (SPIRIT).
In 1996, an interface was developed between a WWW server and a SPIRIT server so that SPIRIT databases could be queried through a standard WWW browser. Initially, the databases were in different languages. The language of each database had to be homogeneous and multibase interrogation was not possible.
Because many of our databases are bilingual (English and French) and because single documents often appear in several languages (e.g., summaries in two languages), we have developed a prototype of a distributed cross-lingual information retrieval system based on SPIRIT-W3 architecture, which has been generalized for this use.
SPIRIT is based on linguistic processing that is performed both on database texts and on queries. This processing is robust and not domain-dependent. The level of linguistic parsing is morphosyntatic. Morphologic analysis is simple because of the use of a full-form dictionary. Syntactic analysis is divided into a disambiguation tool (tagger) and a dependency relation analysis. After the morphosyntactic analysis is done, words and compounds are normalized and stop words are eliminated according to their part of speech.
The statistical processing is done for the input texts as a result of the linguistic processing. This statistical processing assigns a weight for each normalized word or compound. This weight is related to the information brought by the word, which is helpful for the selection of the relevant documents. This weight is used during interrogation to rank the documents according their semantic proximity to the query.
The reformulation tool is used during the query processing. The role of this tool is to reduce the gap between the user vocabulary contained in the query and the vocabulary used to convey the same ideas in the database. This tool is also used to make cross-lingual reformulation. The reformulation has a large domain-independent element which can be used in all types of databases. It is also possible to add domain reformulation knowledge as a supplement. Many inferred words are not relevant but the comparison mechanism that gives priority to documents having the greatest intersection with the query is generally sufficient to choose, for example, between different translations. The last tool is the intersection evaluator that identifies the concept intersection between the query and the texts and computes a weight for this intersection.
We will describe the functions of the different parts of SPIRIT, which link a query to relevant parts of a text and minimize the links to irrelevant ones. Suppose that the query contains the French phrase "le décollage de la fusée" and some documents contain the phrase "the rocket lift-off." In processing the document, the linguistic parsing step recognizes "rocket" as a noun, and "lift-off" as a noun. If "lift off" has been entered into the idiom dictionary as a single word with the normalized form "lift-off," all occurrences ("lift-off", "lift off ", and even "liftoff") are considered the same concept. In this case, the compound "rocket lift-off" has also been recognized.
The query is processed by the same linguistic processing, now using the French language environment. In "le décollage de la fusée," "décollage" is clearly a noun from the syntactical point of view. "Fusée" can be a noun or a past participle of the verb "fuser." Tagging solves the problem because in this context, "fusée" can only be a noun. Of course, the compound "décollage de la fusée" has also been recognized.
Now comes the reformation process. "Décollage" as a noun can be translated as take-off, lift-off, unsticking; "fusée" as a noun can be translated as rocket, missile, fuse, spindle, stub axle. Because of the tagger, all translations compatible with the part of speech "verb" are eliminated. That means that to burst forth, to gush, to spur out, to fly out, to stream out, etc. are not proposed for searching in the target language documents.
As you can see, there are several translations and some are semantically incorrect. To eliminate ambiguity, we use the database itself. In the first step, translations that are not in the database index are eliminated. Next, translations that are not relevant to the domain are eliminated. That is probably the case for unsticking as the translation of "décollage" if the database deals with aeronautics and space. It is probably also the case for fuse, spindle, and stub axle as translation of "fusée." Filtering by the database index is not sufficient to eliminate all incorrect translations, especially if the database content is not technology-specific but covers a wider scope. The second way of filtering out the wrong translations is by the use of words and compounds contained in the "best" documents. This filtering works well if the query has an answer in the database. It is sometimes slightly difficult to be sure that the document that has the best intersection with the query (i.e., the intersection with the best weights) is a relevant document. In these cases, we believe that the intersection is sufficient between the query concepts and the documents, and we only keep the translations that are in these best documents.
For example, if there are documents that have one or more translation of each original query word, we can consider these translations the best ones. In our example, we have two words but these words are linked in a dependency relation, which means they form a compound. Of course, if we can establish that the same dependency relation exists in the source language and in the target one, it will confirm that the translation is the best one.
This is not so easy because word order is not always the same in the source and target language. It is necessary to use transformation (or reordering) rules to produce "rocket lift-off" from "décollage (de la) fusée". In this way, the cross-lingual reformation process follows SPIRIT. If we suppose that the best documents only contain "rocket lift-off," it is now possible to get feedback to increase the precision of the answer by searching only with the best translations: "rocket" and "lift-off" and "rocket lift-off." In some cases, this kind of filtering is too drastic. It can eliminate translations that are synonyms of the best translations that are correct from the semantic point of view. That is why it is possible at this level to make a unilingual target language reformulation to reintroduce "missile", for example, as a possible good keyword to search.
This filtering by the best documents can also be used in the case of double translation. That means translation from language A to B and then B to C. This is important from a cost point of view. It is not possible to expect that there will be transfer dictionaries from any language to any other one. Transfer dictionaries from Finnish to Greek can be hard to find but Finnish to English and English to Greek can be found.
It is very important to have a strong filtering in the pivot language (English in the preceding example) because the remaining wrong translations may result in retrieval of many irrelevant documents. It is important that a relevant document be found in the pivot language in order to permit the filtering to work. This means that the pivot language must be chosen for each database according to the existence of the maximum number of documents in the pivot language covering the largest subject field.
The use of the system follows this typical scenario:
Such a scenario is very interactive. In practice, to cope with network latency in particular and to be consistent with other Web applications, Steps 3 and 4 are combined and a partial list of documents is also shown in Step 4.
In order for WWW clients to gain access to SPIRIT servers, we had to solve two categories of problems: a gateway problem of mapping a session-oriented protocol (like SPIRIT) to a stateless-oriented protocol like HTTP, and an application development (or customization) problem in order to be able to integrate the database with a Web service and quickly add new sources.
The overall program architecture is for widespread use and is called "three-tier architecture": an HTTP daemon, a set of CGI programs that forms SPIRIT-W3, and native database services through a SPIRIT server (and eventually a SQL service by Sybase).
The problem of gateways has already been tackled by numerous authors (e.g., Perrochon, Salomon, and Barta) and vendors. We present the results of our experience. First, the different states of the application and associated database engines have to be carefully analyzed in order to estimate what the accessible state variables are, what the sizes involved in a state are, what the cost (in terms of CPU and/or bandwidth) to go from one state to another is, and what the limits in terms of storage space for the states are (e.g. maximum number of open connections, maximum size of URL, and disk space). This analysis is greatly facilitated if you can access the database engine through a (sufficiently detailed) application programming interface.
Then transitions between states must be checked in different scenarios. Modality is very difficult to achieve with a browser (or a combination of browser and cache), so the application (on the server side) must be prepared to handle all sort of requests. Three cases encountered are: a "normal" transition (e.g., going from a list of concept intersections to a list of documents), an "abnormal" transition but one in which a "graceful" recovery can be achieved (e.g., by ignoring user preferences after a time out), and an "error" transition in which the transition requested cannot be achieved and the user is directed to an initialization state. At the end of this analysis you can specify the role of each part of the application.
In our case, we chose to develop a CGI program that related to each step of a session (choosing a database, submitting a query form, getting lists of results, displaying a document, etc.). This simplifies development and handling of transitions. Next, one has to choose (or combine) the various means that can be used to keep state information:
We needed to be able to tailor the presentations (and navigation scenarios) of several databases and to publish them in a short time. Therefore, we choose a presentation-driven model to configure a SPIRIT-W3 application. Each module that makes up the application is configured by a file written in an extension to HTML. All HTML (and even embedded scripts) can be used to define page layout. We added new tags for the verbs the language needed. These tags follow SGML constraints, so by extending common DTD for HTML, we can still have a first level of validation (using a validating parser and/or an SGML editor) of the configuration file and of HTML output.
The main tags added are:
<SPI_PRT>with the attributes field name, transcoding of charset and end of lines, HTML tagging for highlighting terms.
<A HREF>tag and URL with fields or variable substitution, by a container
<SPI_URL>..</SPI_URL>, the attributes of which allow specifying the URL to generate by a printlike syntax. This allows dynamically generating links to other modules of SPIRIT-W3, passing state id and other parameters, as well as transforming some fields into links (e.g. changing an e-mail address into a
<SPI_WHILE>..</SPI_WHILE>construction, the attributes of which specify which result set (classes of concept, list of documents, etc.) to use, and optional grouping and sorting criteria (e.g., sorting by journal issue or by year).
The language can also manipulate variables that are:
Each module uses a set of configuration files and selects the appropriate one at run time based on a set of parameters: database name, language of user interface (inherited by accept-language header or specified), character set to use on output, and a custom parameter (e.g., to allow different presentations of the same document).
Finally, matching occurs between a query form and the actual query
of the database as achieved by a special syntax of the form field
names: the name of the form field specifies the type of query
(natural language, Boolean equality, and comparison), and the
list of database fields on which this query is acted upon. For
example, a form field of name
specifies a search of its content in natural language in the fields
ABSTRACT of the database.
The EMIR project proved the feasibility of the interrogation in one language of a database containing documents in another language. In our real applications, the problem is a little more complicated. Databases contain documents in several languages (mainly French and English) and even the description of a document can be in more than one language. For example, the title can be in two languages, the summary in two languages, and so forth.
The SPIRIT servers can only manage one language at one time. That is why we have decided to extend the SPIRIT-W3 interface to have a multibase capability. This multibase capability can solve our problem, which is to be able to interrogate, at the same time, databases that are in different languages. The multilingual databases are split into as many separate databases as languages in the original databases. So at this time, multilingual database access is converted into a multibase access, eventually with a cross-lingual reformulation.
The user can identify a logical database to interrogate. This database is the union of different databases, each of which can be homogeneous language parts of multilingual ones. The interface sends the request to the various databases with an indication of the query language (the database knows the target one). The answers are merged and a weight is recomputed for each word to simulate the word weights in the logical database. Documents are grouped into classes of concept intersection and the result is sorted according to the relevance given to each class. The answer presented to the user is usually a list of titles. Full information can be obtain by a link. In the case of documents containing information in several languages, the document is recomposed with information coming from the different unilingual databases.
In the original SPIRIT-W3 design, we assumed that only one database was queried at a time, that database having a known structure (list of fields with associated type). To achieve multidatabase and multilingual capabilities, we added an abstraction layer that permitted a "view" of a set of native databases as a single "logical" database. A view defines fields and how these fields are related to fields in actual ("physical") databases. A field has a name and a set of properties:
The relation between databases allows for definition of how to express queries and combine results. There are two operations at the database level: join and concatenation.
LIBRARY = join(LIBRARY_F.REFERENCE, LIBRARY_E.REFERENCE)(which means that when a record in database LIBRARY_F has same value for field REFERENCE as a record in database LIBRARY_E, the association of the two records form one record in the logical database LIBRARY). In the context of the join, we then define the resulting fields by operators in the elementary fields; we have the following operators:
LIBRARY.TITLE_F = LIBRARY_F.TITLE_Fmeans that to manipulate (display or query) field TITLE_F, it is sufficient to apply the same operation to field TITLE_F of database LIBRARY_F.
LIBRARY.AUTHOR = same(LIBRARY_F.AUTHORA, LIBRARY_E.AUTHORB)means that it is guaranteed that the fields AUTHORA and AUTHORB contain the same data in corresponding records of database LIBRARY_F and LIBRARY_E, so the SPIRIT-W3 engine can use them in an equivalent manner. It can only be used if the attributes of both fields are compatible. This is used for display: displaying field AUTHOR (of LIBRARY) means display AUTHORA or AUTHORB but not both; the engine will chose which real field to use based on optimization (e.g., already in the cache). It is also particularly useful for queries over a multilingual database: when one criterion is factual (meaning that all records in the result set must meet the requirement, as in YEAR=1997) and another is textual, it is far more efficient to have the AND corresponding operation checked by the database engine; therefore, the factual field must be present in every single-language associated database.
LIBRARY.ABSTRACT = union( LIBRARY_F.ABSTRACT_F, LIBRARY_E.ABSTRACT_E)allows fields in different databases to be manipulated under one name. It is used extensively on multilingual databases: the (logical) field ABSTRACT is multilingual and can be queried. The query can be transparently routed to the real databases with different reformulations as defined by the pairs (language of query, language of (real) database). If used for display, it concatenates the two (or more) fields before display, hence the two fields must have the same attributes.
During the spring and summer 1996, a prototype was developed that gave promising results with our library catalogue. We are developing a final version and in mid 1997 will start the operation of several databases (library catalogues, database of titles (and summaries) of 3,000 scientific journals (data from the British Library), catalogue of publications issued by our organization's research scientists, and a catalogue of controlled Internet resources. For the last application, downloaded pages from servers, identified as good-quality ones in our scope, will be indexed whatever their language (French, English, German, and Russian).
Since there is a huge audience for these applications, we anticipate abundant feedback for evaluation of application effectiveness. These results will be reported during the conference.
Jumping from implementing prototype evaluation to placing a real-sized application in the hands of thousands of users is a hard task. This is the only way to prove the value of the research. There is a strong need in organizations like ours to access information that merges two or more languages. We hope that the service we will provide to our users will cover their needs. If the first application is a success, the approach will be generalized to all our bilingual databases. Experiments will also continue for expansion into other languages.