SPIRIT-W3: A Distributed Cross-Lingual Indexing and Search Engine

Christian Fluhr <fluhr@tabarly.saclay.cea.fr>
Dominique Schmit <dschmit@tabarly.saclay.cea.fr>
Philippe Ortet
Faïza Elkateb
Karine Gurtner
Commissariat à l'Énergie Atomique
France

Abstract

This paper describes the architecture and working principles of a system that can index documents (and parts of documents) in several languages and is able to retrieve information using one single-language query. The system is based on an automatic linguistic process of both documents and queries; general unilingual and bilingual reformulation rules are used to reduce the gap between user needs expressed by the query and document contents. The system includes an interface between HTTP and SPIRIT servers and is accessible through standard WWW clients. Design methodology and architecture details of this type of interfaces are also presented.

Keywords: information retrieval systems, multilingual full-text databases, cross-lingual interrogation, language engineering, distributed full-text databases on the Internet, database gateways.

1 The context
2 The background
3 Brief description of the SPIRIT indexing and retrieval technology
4 How the multilingual reformulation works
5 Architecture of SPIRIT server accessibility by standard Internet clients
- The gateway problem: Keeping context and collaborating with caches
- Developing applications using an extension to HTML
6 Architecture of cross-lingual access to SPIRIT databases
7 Experimentation
8 Conclusion
9 References

1 The context

Many full-text corpora consist of documents in more than one language. This is the case for scientific and technical documents from non-English-speaking countries and for documents in full-text databases that are generally in at least two languages. It is also the case for countries that have several official languages (Canada, Belgium, Switzerland, Tunisia, European Union, etc.). On the Internet, interesting information can be found on pages not written in English, and it is necessary to have systems that can effectively index texts in several languages and that enable users to search through texts of various languages in their mother tongue.

2 The background

From 1990 to 1994, a European ESPRIT project, European Multilingual Information Retrieval (EMIR), demonstrated the feasibility of cross-lingual interrogation. The languages accepted were French, English, and German. Russian is now being implemented with UNESCO funding and Dutch will follow, with Belgian funding. From 1994 to 1996, the results of this project were integrated into a commercial client-server information retrieval system, Syntactic and Probabilistic Indexing and Retrieval of Information in Texts (SPIRIT).

In 1996, an interface was developed between a WWW server and a SPIRIT server so that SPIRIT databases could be queried through a standard WWW browser. Initially, the databases were in different languages. The language of each database had to be homogeneous and multibase interrogation was not possible.

Because many of our databases are bilingual (English and French) and because single documents often appear in several languages (e.g., summaries in two languages), we have developed a prototype of a distributed cross-lingual information retrieval system based on SPIRIT-W3 architecture, which has been generalized for this use.

3 Brief description of the SPIRIT indexing and retrieval technology

SPIRIT is based on linguistic processing that is performed both on database texts and on queries. This processing is robust and not domain-dependent. The level of linguistic parsing is morphosyntatic. Morphologic analysis is simple because of the use of a full-form dictionary. Syntactic analysis is divided into a disambiguation tool (tagger) and a dependency relation analysis. After the morphosyntactic analysis is done, words and compounds are normalized and stop words are eliminated according to their part of speech.

The statistical processing is done for the input texts as a result of the linguistic processing. This statistical processing assigns a weight for each normalized word or compound. This weight is related to the information brought by the word, which is helpful for the selection of the relevant documents. This weight is used during interrogation to rank the documents according their semantic proximity to the query.

The reformulation tool is used during the query processing. The role of this tool is to reduce the gap between the user vocabulary contained in the query and the vocabulary used to convey the same ideas in the database. This tool is also used to make cross-lingual reformulation. The reformulation has a large domain-independent element which can be used in all types of databases. It is also possible to add domain reformulation knowledge as a supplement. Many inferred words are not relevant but the comparison mechanism that gives priority to documents having the greatest intersection with the query is generally sufficient to choose, for example, between different translations. The last tool is the intersection evaluator that identifies the concept intersection between the query and the texts and computes a weight for this intersection.

4 How the multilingual reformulation works

We will describe the functions of the different parts of SPIRIT, which link a query to relevant parts of a text and minimize the links to irrelevant ones. Suppose that the query contains the French phrase "le décollage de la fusée" and some documents contain the phrase "the rocket lift-off." In processing the document, the linguistic parsing step recognizes "rocket" as a noun, and "lift-off" as a noun. If "lift off" has been entered into the idiom dictionary as a single word with the normalized form "lift-off," all occurrences ("lift-off", "lift off ", and even "liftoff") are considered the same concept. In this case, the compound "rocket lift-off" has also been recognized.

The query is processed by the same linguistic processing, now using the French language environment. In "le décollage de la fusée," "décollage" is clearly a noun from the syntactical point of view. "Fusée" can be a noun or a past participle of the verb "fuser." Tagging solves the problem because in this context, "fusée" can only be a noun. Of course, the compound "décollage de la fusée" has also been recognized.

Now comes the reformation process. "Décollage" as a noun can be translated as take-off, lift-off, unsticking; "fusée" as a noun can be translated as rocket, missile, fuse, spindle, stub axle. Because of the tagger, all translations compatible with the part of speech "verb" are eliminated. That means that to burst forth, to gush, to spur out, to fly out, to stream out, etc. are not proposed for searching in the target language documents.

As you can see, there are several translations and some are semantically incorrect. To eliminate ambiguity, we use the database itself. In the first step, translations that are not in the database index are eliminated. Next, translations that are not relevant to the domain are eliminated. That is probably the case for unsticking as the translation of "décollage" if the database deals with aeronautics and space. It is probably also the case for fuse, spindle, and stub axle as translation of "fusée." Filtering by the database index is not sufficient to eliminate all incorrect translations, especially if the database content is not technology-specific but covers a wider scope. The second way of filtering out the wrong translations is by the use of words and compounds contained in the "best" documents. This filtering works well if the query has an answer in the database. It is sometimes slightly difficult to be sure that the document that has the best intersection with the query (i.e., the intersection with the best weights) is a relevant document. In these cases, we believe that the intersection is sufficient between the query concepts and the documents, and we only keep the translations that are in these best documents.

For example, if there are documents that have one or more translation of each original query word, we can consider these translations the best ones. In our example, we have two words but these words are linked in a dependency relation, which means they form a compound. Of course, if we can establish that the same dependency relation exists in the source language and in the target one, it will confirm that the translation is the best one.

This is not so easy because word order is not always the same in the source and target language. It is necessary to use transformation (or reordering) rules to produce "rocket lift-off" from "décollage (de la) fusée". In this way, the cross-lingual reformation process follows SPIRIT. If we suppose that the best documents only contain "rocket lift-off," it is now possible to get feedback to increase the precision of the answer by searching only with the best translations: "rocket" and "lift-off" and "rocket lift-off." In some cases, this kind of filtering is too drastic. It can eliminate translations that are synonyms of the best translations that are correct from the semantic point of view. That is why it is possible at this level to make a unilingual target language reformulation to reintroduce "missile", for example, as a possible good keyword to search.

This filtering by the best documents can also be used in the case of double translation. That means translation from language A to B and then B to C. This is important from a cost point of view. It is not possible to expect that there will be transfer dictionaries from any language to any other one. Transfer dictionaries from Finnish to Greek can be hard to find but Finnish to English and English to Greek can be found.

It is very important to have a strong filtering in the pivot language (English in the preceding example) because the remaining wrong translations may result in retrieval of many irrelevant documents. It is important that a relevant document be found in the pivot language in order to permit the filtering to work. This means that the pivot language must be chosen for each database according to the existence of the maximum number of documents in the pivot language covering the largest subject field.

5 Architecture of SPIRIT server accessibility by standard Internet clients

The use of the system follows this typical scenario:

Select a database. This allows the user to specify the scope of its research (e.g., books available in Saclay Central Library, technical internal reports, or description of lab activities).
Send a multicriterion query. For example, find relevant documents for "rocket lift-off" and date of publication >= 1996.
Get results of the query analysis (i.e., an approximation of what the system will look for in the database). At this stage, indication of unknown terms (i.e., known neither by language dictionaries nor by the database itself) can lead the user to cancel the query or modify it.
Get concept intersections between the query and the database sorted by relevance.
Get a (compact) list of documents for a (set of) concept intersections.
Display one (or several) of the documents with highlighted search terms.

Such a scenario is very interactive. In practice, to cope with network latency in particular and to be consistent with other Web applications, Steps 3 and 4 are combined and a partial list of documents is also shown in Step 4.

In order for WWW clients to gain access to SPIRIT servers, we had to solve two categories of problems: a gateway problem of mapping a session-oriented protocol (like SPIRIT) to a stateless-oriented protocol like HTTP, and an application development (or customization) problem in order to be able to integrate the database with a Web service and quickly add new sources.

The overall program architecture is for widespread use and is called "three-tier architecture": an HTTP daemon, a set of CGI programs that forms SPIRIT-W3, and native database services through a SPIRIT server (and eventually a SQL service by Sybase).

The gateway problem: Keeping context and collaborating with caches

The problem of gateways has already been tackled by numerous authors (e.g., Perrochon, Salomon, and Barta) and vendors. We present the results of our experience. First, the different states of the application and associated database engines have to be carefully analyzed in order to estimate what the accessible state variables are, what the sizes involved in a state are, what the cost (in terms of CPU and/or bandwidth) to go from one state to another is, and what the limits in terms of storage space for the states are (e.g. maximum number of open connections, maximum size of URL, and disk space). This analysis is greatly facilitated if you can access the database engine through a (sufficiently detailed) application programming interface.

Then transitions between states must be checked in different scenarios. Modality is very difficult to achieve with a browser (or a combination of browser and cache), so the application (on the server side) must be prepared to handle all sort of requests. Three cases encountered are: a "normal" transition (e.g., going from a list of concept intersections to a list of documents), an "abnormal" transition but one in which a "graceful" recovery can be achieved (e.g., by ignoring user preferences after a time out), and an "error" transition in which the transition requested cannot be achieved and the user is directed to an initialization state. At the end of this analysis you can specify the role of each part of the application.

In our case, we chose to develop a CGI program that related to each step of a session (choosing a database, submitting a query form, getting lists of results, displaying a document, etc.). This simplifies development and handling of transitions. Next, one has to choose (or combine) the various means that can be used to keep state information:

Inside the client, through the use of a "long" URL accessed by the GET method (like AltaVista's http://.../query?stq=10&q=Fluhr):
When a state transition happens, the client receives a new page, with a new URL. This technique has the advantage of simplicity. It has no particular problems with caches and bookmarks (the URL defines the state exactly). Its main drawback is that storage space is limited (a few hundred bytes) and that the entire state must traverse the network. Of course, this method only requests transition from an initialization state to the state described, so the application has to go through intermediate states internally if the transition cannot be achieved directly and that can be very costly, even impossible if user interaction is needed.
The analogue method is through (often hidden) fields transmitted by a POST method. The POST method causes problems with caches, particularly on browsers (c.f. the famous question: "Repost form data?"), and today we do not know of any bookmark that can keep the associated fields with a form. So, while POST is the best (or only) method when a certain amount of data has to be transferred, it cannot always be used in place of GET.
Inside the middleware between the HTTP server and the native database server:
This is the solution most often used. A token (identifier) is associated with the state stored and this identifier is passed to the client. On subsequent requests, the client retransmits this token to the server (usually on the URL or by hidden fields). Care must be taken that this token represent a state and not only a session (or user), otherwise inconsistencies rapidly appear. Due to the nonmodality of the browser, you cannot presuppose the order of interaction between client and server, so if a request leads to state N (for session S), you cannot presuppose that the last request led to state N-1.
What is stored depends entirely on the analysis of the system, and of the possibilities of the underlying database services. One has to combine storage of previous-user-supplied data (in order to be able to replay a sequence of events) with look ahead (continuing the most probable scenario). Look ahead (a classical technique on cache devices for memory or disk) can lead to improved overall performance; for example, we use it to store a partial list of documents when the system calculates (and sends back) the list of concept intersections. There is a strong probability that the user will ask for the details of the first intersection (or the next one) after viewing it. To summarize, the database server will service the application by going through different states (e.g. N, P, Q); the choice of what is kept has to be made in order to minimize the cost of changing the server from an initialization (or stable) state to state N, from initialization to state P, or from initialization to state Q, because in this model the server always starts from a single known state (no memory).
Inside the native database server:
This can be the most efficient use of the native service by minimizing the steps to be replayed by it. Implementation can be quite difficult due to two problems. First, very often the database server can keep contexts if its client keeps a network connection open, so you have to solve the problem of connecting the right WWW client to the right connection (one simple solution consists of spawning a separate server on a port specific to the WWW client so that the client may connect directly to the appropriate connection, but this solution has drawbacks that prevent it from being widely used) and to take care of time out or the maximum number of simultaneous connections to the database server. Second, if the system has several states, this solution is efficient only if you can easily either keep multiple states on the server (which is quite never the case) or can go through a minimum of steps from state P to state N (assuming that the last request has left the server in state P and the current request leads to state N).
Inside caches:
Caches can store intermediate results that will be reused several times (e.g., in our application, the list of documents result of a question). Some precautions must be taken in order to use them efficiently:
- Ensure unique URLs for each different set of data sent back (it is always better if the same data have the same URL).
- Help the cache by correctly dating the information. The header "expires" is useful but can lead to surprising results when the machine housing the cache (very often the WWW browser) does not have the correct time or the correct time zone (MS DOS and Win 3.1x default to GMT+8!). The client assumes that the document has expired as soon as it arrives and tries to reload it on the first occasion. This leads to an enormous increase in traffic and several other problems (such as printing the result of a POST request). Therefore, we cannot use "expires" for documents that are too short-lived.
- Be tolerant of an "out-of-date" URL. When the URL includes a token corresponding to a state stored on the server, that URL may be recalled after the corresponding state has been deleted from the server (due to time out and garbage collection). The application should handle this problem "gracefully" by returning to "reasonable" default state.

Developing applications using an extension to HTML

We needed to be able to tailor the presentations (and navigation scenarios) of several databases and to publish them in a short time. Therefore, we choose a presentation-driven model to configure a SPIRIT-W3 application. Each module that makes up the application is configured by a file written in an extension to HTML. All HTML (and even embedded scripts) can be used to define page layout. We added new tags for the verbs the language needed. These tags follow SGML constraints, so by extending common DTD for HTML, we can still have a first level of validation (using a validating parser and/or an SGML editor) of the configuration file and of HTML output.

The main tags added are:

Field or variable substitution, by an empty element <SPI_PRT> with the attributes field name, transcoding of charset and end of lines, HTML tagging for highlighting terms.
Construction of <A HREF> tag and URL with fields or variable substitution, by a container <SPI_URL>..</SPI_URL>, the attributes of which allow specifying the URL to generate by a printlike syntax. This allows dynamically generating links to other modules of SPIRIT-W3, passing state id and other parameters, as well as transforming some fields into links (e.g. changing an e-mail address into a "mailto:" URL).
Conditional output based on test of field or variable, with a <SPI_IF>, <SPI_ELSE> and </SPI_IF> construction.
Iterative loop over a result set by means of a <SPI_WHILE>..</SPI_WHILE> construction, the attributes of which specify which result set (classes of concept, list of documents, etc.) to use, and optional grouping and sorting criteria (e.g., sorting by journal issue or by year).

The language can also manipulate variables that are:

Defined by the system (such as number of documents in a result set or in the database, strings defining concept intersections, loop indexes, etc.); and
Defined by the application developer; we can store user preferences for a session and retrieve them as needed, or by passing data on a URL (or form fields), we can "chain" several queries into two or more databases (the result of one query being used as some of the criteria for another query: a mechanism analogue to the join, but taking place between document databases).

Each module uses a set of configuration files and selects the appropriate one at run time based on a set of parameters: database name, language of user interface (inherited by accept-language header or specified), character set to use on output, and a custom parameter (e.g., to allow different presentations of the same document).

Finally, matching occurs between a query form and the actual query of the database as achieved by a special syntax of the form field names: the name of the form field specifies the type of query (natural language, Boolean equality, and comparison), and the list of database fields on which this query is acted upon. For example, a form field of name "T:TITLE,ABSTRACT" specifies a search of its content in natural language in the fields TITLE and ABSTRACT of the database.

6 Architecture of cross-lingual access to SPIRIT databases

The EMIR project proved the feasibility of the interrogation in one language of a database containing documents in another language. In our real applications, the problem is a little more complicated. Databases contain documents in several languages (mainly French and English) and even the description of a document can be in more than one language. For example, the title can be in two languages, the summary in two languages, and so forth.

The SPIRIT servers can only manage one language at one time. That is why we have decided to extend the SPIRIT-W3 interface to have a multibase capability. This multibase capability can solve our problem, which is to be able to interrogate, at the same time, databases that are in different languages. The multilingual databases are split into as many separate databases as languages in the original databases. So at this time, multilingual database access is converted into a multibase access, eventually with a cross-lingual reformulation.

The user can identify a logical database to interrogate. This database is the union of different databases, each of which can be homogeneous language parts of multilingual ones. The interface sends the request to the various databases with an indication of the query language (the database knows the target one). The answers are merged and a weight is recomputed for each word to simulate the word weights in the logical database. Documents are grouped into classes of concept intersection and the result is sorted according to the relevance given to each class. The answer presented to the user is usually a list of titles. Full information can be obtain by a link. In the case of documents containing information in several languages, the document is recomposed with information coming from the different unilingual databases.

In the original SPIRIT-W3 design, we assumed that only one database was queried at a time, that database having a known structure (list of fields with associated type). To achieve multidatabase and multilingual capabilities, we added an abstraction layer that permitted a "view" of a set of native databases as a single "logical" database. A view defines fields and how these fields are related to fields in actual ("physical") databases. A field has a name and a set of properties:

Field is displayable or not. If the field is displayable:
- appearance is modified by query (e.g., with relevant words highlighted) or not;
- all attributes are useful to render the content: language, encoding used in field such as plain text, HTML, special formats such as encoding of subscript/superscript, TeX-like formulas, special entities, and character set used.
Field may be used in a query or not. If the field may be queried:
- general type of query: natural language or factual (unweighed match);
- for natural language: language of the content;
- for facts: type such as character, dates, number, and splitting in subfields or not.
Field may be used as a unique key or not (each record in a SPIRIT database has at least a primary key).

The relation between databases allows for definition of how to express queries and combine results. There are two operations at the database level: join and concatenation.

Join

The matching criterion is expressed as equality of two (or more) unique keys in the two (or more) databases related as in LIBRARY = join(LIBRARY_F.REFERENCE, LIBRARY_E.REFERENCE) (which means that when a record in database LIBRARY_F has same value for field REFERENCE as a record in database LIBRARY_E, the association of the two records form one record in the logical database LIBRARY). In the context of the join, we then define the resulting fields by operators in the elementary fields; we have the following operators:

equal: LIBRARY.TITLE_F = LIBRARY_F.TITLE_F means that to manipulate (display or query) field TITLE_F, it is sufficient to apply the same operation to field TITLE_F of database LIBRARY_F.
same: LIBRARY.AUTHOR = same(LIBRARY_F.AUTHORA, LIBRARY_E.AUTHORB) means that it is guaranteed that the fields AUTHORA and AUTHORB contain the same data in corresponding records of database LIBRARY_F and LIBRARY_E, so the SPIRIT-W3 engine can use them in an equivalent manner. It can only be used if the attributes of both fields are compatible. This is used for display: displaying field AUTHOR (of LIBRARY) means display AUTHORA or AUTHORB but not both; the engine will chose which real field to use based on optimization (e.g., already in the cache). It is also particularly useful for queries over a multilingual database: when one criterion is factual (meaning that all records in the result set must meet the requirement, as in YEAR=1997) and another is textual, it is far more efficient to have the AND corresponding operation checked by the database engine; therefore, the factual field must be present in every single-language associated database.
union: LIBRARY.ABSTRACT = union( LIBRARY_F.ABSTRACT_F, LIBRARY_E.ABSTRACT_E) allows fields in different databases to be manipulated under one name. It is used extensively on multilingual databases: the (logical) field ABSTRACT is multilingual and can be queried. The query can be transparently routed to the real databases with different reformulations as defined by the pairs (language of query, language of (real) database). If used for display, it concatenates the two (or more) fields before display, hence the two fields must have the same attributes.
union-xor: is simply a variant of union where it is guaranteed that only one of the fields is not empty at a given time. It allows optimization on display operations.

Concatenation

Records coming from different databases (those databases being native databases or result of join) are considered different. The primary key of the resulting database is the primary key of the corresponding record, eventually prefixed by a string unique to the underlying database in order to ensure uniqueness. The definitions of fields are the same as for a join (with the exception of the "same" operator, which cannot be used). If the databases do not have the same structure, the missing fields in one record are considered empty.

7 Experimentation

During the spring and summer 1996, a prototype was developed that gave promising results with our library catalogue. We are developing a final version and in mid 1997 will start the operation of several databases (library catalogues, database of titles (and summaries) of 3,000 scientific journals (data from the British Library), catalogue of publications issued by our organization's research scientists, and a catalogue of controlled Internet resources. For the last application, downloaded pages from servers, identified as good-quality ones in our scope, will be indexed whatever their language (French, English, German, and Russian).

Since there is a huge audience for these applications, we anticipate abundant feedback for evaluation of application effectiveness. These results will be reported during the conference.

8 Conclusion

Jumping from implementing prototype evaluation to placing a real-sized application in the hands of thousands of users is a hard task. This is the only way to prove the value of the research. There is a strong need in organizations like ours to access information that merges two or more languages. We hope that the service we will provide to our users will cover their needs. If the first application is a success, the approach will be generalized to all our bilingual databases. Experiments will also continue for expansion into other languages.

9 References

Barta, R. A., and Hauswirth, M., Interface-parasite Gateways. Proceeding of the Fourth International WWW Conference, December 1995, Boston.
Debili, F., Fluhr, C., and Radasoa, P., About reformulation in fulltext IRS, Conference RIAO 88, MIT Cambridge, March 1988, A modified text has been published in "Information processing and management" Vol. 25, No. 6 1989, pp. 647-657.
Fuller, C., Multilingual Information, Pacific Rim International Conference on Artificial Intelligence (PRICE), "AI and large-scale Information", Nagoya, 14-16 November 1990.
Fuller, C., and Radian, Kh., Fulltext databases as lexical semantic knowledge for multilingual interrogation and machine translation, EWAIC'93 Conference, Moscow, 7-9 September 1993.
Fluhr, C., Mordini, P., Moulin, A. and Stegentritt, E., EMIR Final report, ESPRIT project 5312, DG III, Commission of the European Union, October 1994.
Fluhr, C., Schmit, D., Ortet, P., Elkateb, F., Gurtner, K., and Semenova, V., Distributed multilingual information retrieval, MULSAIC Workshop, ECAI96 Conference, Budapest, 12-16 August 1996.
Gachot, D., Lange, E., and Yang, J., The SYSTRAN NLP Browser: An Application of Machine Translation Technology in Multilingual Information Retrieval, Cross-Linguistic InformationRetrieval Workshop, SIGIR'96, 18-22 August, Zurich, Switzerland.
Landauer, T. K., and Littman, M. L., Fully Automatic Cross-Language Document Retrieval Using Latent Semantic Indexing, (1990) in Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, UW Centre for the New OED and Text Research, Waterloo, Ontario, Canada.
Perrochon, L., Translation servers: gateways between stateless and stateful information systems. Network Services Conference, November 1994, London.
Radwan, Kh., Foussier, F., and Fluhr, C., Multilingual access to textual databases. RIAO'91 Conference, April 1991, Barcelona.
Radwan, Kh., and Fluhr, C., Textual database lexicon used as a filter to resolve semantic ambiguity, application on multilingual information retrieval, 4th annual symposium on document analysis and information retrieval, 24-26 April 1995, Las Vegas.
Salomon, M., A Simple Server Architecture for HTTP Sessions. Fifth International WWW Conference, May 1996, Paris.

SPIRIT-W3: A Distributed Cross-Lingual Indexing and Search Engine

Abstract

Contents