A Multilingual (Chinese, English) Indexing, Retrieval, Searching Search Engine

Wen-hui ZHANG (whzhang@cnnic.net.cn)
China Internet Network Information Center
P.R.China
Hua-lin QIAN (qian@oak.cnnic.net.cn)
Computer Network and Information Center, Chinese Academy of Sciences
P.R.China
Wei MAO (mao@cnnic.net.cn)
Computer Network and Information Center, Chinese Academy of Sciences
P.R.China
Guo-nian SUN (sun@cnnic.net.cn)
Computer Network and Information Center, Chinese Academy of Sciences
P.R.China

Abstract

This paper presents the concepts, technologies, algorithms and detailed measures to achieve the goal of providing highly relevant, almost complete and up-to-date multilingual information search, index, and retrieve. The architecture is also described.

Contents

Introduction

In the past years, the Internet revolution throughout the computing world, catalyzed largely by the World Wide Web (WWW), has enabled the widespread dissemination of information worldwide. Web search engines have proliferated with the goal of enhancing people's needs for finding and accessing external information. However, with the explosive growth of dynamic and massive Web information, search engines often return too many but incomplete references, most of which turn out to be irrelevant, some of which are stale. In addition, as more and more people on the Internet are from non-English-speaking countries and Web documents can be found in languages other than English. Multilingual search engine is strongly required for users to be able to retrieve information from multilingual databases. Having originated from the West, these search engines have not been developed with multilingual capability in mind. This problem  is particularly acute for Chinese, a non-romanised language. Presently, the Internet is positioned to be an international mechanism for communications and information exchange. It is indispensable to develop systems that can efficiently search, index and retrieve multilingual information, especially mother tongue and English information.

Based on a key project of Chinese Academy of Sciences which has been accomplished in June,1998, this paper provides our solution to the above problems. The language accepted are simplified Chinese and English, Traditional Chinese (Big5) and Traditional Chinese (EUC) will be added in the near future.

Our established system consists of two subsystems: organization based subsystem and Web-page based subsystem. Organization based subsystem is developed specially for users to find whether a certain organization in China has its own Web-site and some detail information about that organization. Web-page based subsystem is developed for users to searching information in Web documents.

We develop our Cspider to traverse the Web on China Internet to ensure that nearly all of the Web servers, each with almost all of the pages, are represented in our database. We create cooperating and distributing Cspiders that share the work of finding things on the Web without duplicating each other's efforts.

Users can conduct searches using uniform query interface that is both for Chinese and English, the character encoding detection module determines the language and sends request to the database supporting the corresponding language. A set of information retrieve functions that natively support Chinese and English searches are available: Boolean query (and, or, not), phrase query, fuzzy query, proximity query, stemming, language parsing, spelling error override.

To remedy the irrelevant problem in information searching, we extract attributes and form a resource profile for each searched document. The resource profile is classified into different categories and indexed with keywords. When users submit their queries, they should also specify the categories which the expected results are likely in. Moreover, a weighting algorithm is designed to determine the relevance and rank results in order of relevancy.

In order to maintain up-to-date information, a time-to-live value (TTL) is set for each stored object. An object remains unchanged until it expires. The expiration of the TTL indicates that the related document must be checked again by invoking the Cspider. Periodically, the Cspider retrieves new Web information and suppresses duplicates.

The Web-page Based Subsystem Structure

As depicted in Fig.1, the system consists of four parts: the information gatherer, the search engine, the query server and the scheduler. The information gatherer is designed to run at distributed sites for collecting Web documents within particular topology scope, generating resource profile for each document through summarizing module and classifying the resource profiles into different categories. To speed up the process, as many search engines as the number of categories are concurrently initiated, each search engine serves an individual category. The category-based search engine gets and indexes resource profiles of the specific category from all information gatherers. The query server accepts user's query and invokes related search engine, organizes the retrieved results provided by search engine. The scheduler is responsible for scheduling and dispatching tasks to the distributed gatherers and search engines. In order to accommodate the constantly growing number of Web servers and information online, we have given particular attentions to developing a flexible and extensible architecture, so that the system is capable of on-the-fly reconfiguration.

Fig.1 The Web-page Based Subsystem Structure

Information Gatherer

The Gatherer's main tasks are collecting pages, making summary, classifying information into categories. Cspiders is responsible for the first task: download documents and revisit pages to update information. Summarizing module extracts significant information judged to be representative of the given documents and form the resource profiles. Then, the resource profiles are sorted to different categories.

Cspider

According to the IP network numbers, the whole China Internet is partitioned into several topological scopes when configuring the scheduler. A set of Cspiders are initialized by the scheduler. Each Cspider is instructed by the scheduler to work on a particular scope and collects documents that fall within that scope. The Cspider is also informed with one or several "Top" URLs to start its traversing. Moreover, the scheduler propagates every Cpsider's scope to other Cspiders.

The Cspider generates a visiting profile with the "Top" URLs, scope, search depth, and other parameters. Later on, the visiting profile can also be dynamically expanded with URLs sent from other Cspiders. The way the Cspider traverses Web pages is in breadth-first order. The Cspider downloads the page indicated by the known URL, marking it as having been processed, extracting any outbound links, caching the downloaded document for later summary and categorization. Non-indexable files, such as pictures, sounds, PostScript, or binary data are not processed. Taking one of the links that leads to a new page within the local scope, the Cspider repeats the above process. Before visiting a Web site, the Cspider checks the predefined scope to know whether the site is within the local scope or not. If not, the Cspider will send a request to the corresponding Cspider which is working on the right scope.

When people update their Web servers, documents may be moved or deleted, content of Web pages may be changed, and hypertext links may be broken. We designed a strategy to update the resource profile periodically. For each downloaded page, the information gatherer keeps unique identifier, TTL, and other values extracted from the resource profile. The TTL value determines how often Cspider should check the site. When the gatherer determines that a URL should be revisited, it marks these URLs as needed to be traversed again. Then the Cspider would receive revisiting commands from the gatherer, and it first issues a HEAD request for the page before downloading the whole one. Some servers return the information contained in the HTTP response header, including the length in bytes and the last modification date of the resource. This information can be used to determine whether the resource profile needs to be updated. If a Cspider finds some pages are unreachable, it informs the related search engines.

Encoding Detection

Chinese and other East Asia languages' character is encoded with multiple bytes to guarantee sufficient coding space for these languages. Chinese characters are widely used by East Asia, mainly in Chinese, Japanese and Korean. On the Internet there are three popular coding systems for Chinese: GB, Big5, and EUC. In mainland of China, GB2312 code is a national standard, and is supported by all Chinese systems. In Taiwan and Hong Kong, Big5 is more commonly used than any other codes.

Chinese language contains a large set of characters, about several thousands for common use. All characters can be represented by GB codes. Each Chinese character consists of two bytes with the most significant bit of each byte set to 1. This enables us to distinguish between Chinese and English characters. There are some special characters in Chinese documents, such as punctuation, symbols, Roman letters, digits, etc. represented by GB codes too, that means they will occupy two bytes instead of one as in English documents.

Word Segmentation

Word segmentation is necessary for languages such as Chinese and Japanese that do not contain white space between words. Therefore, Chinese words cannot be directly used to index or search text as it is allowed in English unless the sentence is pre-segmented into words. To perform retrieval in Chinese, a crucial step is to segment a sentence into meaningful words.

To segment a sentence into words, we implement a procedure based on a dictionary containing general Chinese words, idiom list and some professional word lists about special field of study. In the procedure, the special characters in Chinese documents are used to split a sentence into sub-sentences. A Chinese stop-word list is created and used to break the sub-sentence into groups of words. And words in a word group are sequentially scanned to match the dictionary.

For further separation of word group into meaningful words, there are three basic methods.

a) Longest match, a word group is sequentially scanned to match the dictionary. The longest matched string is taken as a word. The remaining part of the word group will be processed similarly.

b) Shortest match, a word group is sequentially scanned to match the dictionary. The first matched string is taken as a word. The remaining part of the word group will be processed similarly.

c) Overlap match, two matched words can overlap each other across the matching boundary.

Longest match will generate less and longer words, when user submits a string shorter than the words, there may appear no matching result. Shortest match will cause another problem---miss-segmentation and may also lead to no matching result. For instance, a group of words such as C1-C2-C3-C4-C5-C6, using shortest match, the C1-C2, C3-C4, C5-C6 are produced, but actually, it should be C1-C2-C3, C4-C5-C6. Using the overlap matching, the result C1-C2, C2-C3, C3-C4, C4-C5, C5-C6 may be produced, this will guarantee precision. In our case, we combine the shortest match and the overlap match.

Summarizing

This module extracts information from html documents cached in local disk to generate resource profiles.

Sentences in one html document are not equally important. Page titles, headings, hyperlink anchor names, words in bold-face, words in italic, and words in the first sentence of every list item are more important than other words or sentences in the same document. So HTML tags are the indicators of the information importance. Based on HTML tags, we classify sentences enclosed by tag pairs into different important levels and formulate document descriptor. Page titles, headings, hyperlink anchor names are ranked as level 1. For short document, all of the words except the above items will be ranked as level 2. For long document, the sentences enclosed by HTML tags such as 'th', 'caption', 'pre', 'li' are also in level 2. The sentences enclosed by HTML tags such as 'tr', 'td', level 3, and so on. If the tag-enclosed sentences is within 30 words, it will be kept as a whole. Otherwise, the first paragraph and the first sentence of other paragraphs will be kept.

The resource profile consists of the following fields: URL, Identifier, Title, Update Time, TTL, Size, Language, Category, Last-visit Time, keywords, and document descriptor. This format has some similarity to that of other meta-information definitions and we will follow whatever meta-information standard evolves. The resource profiles are stored in a relational database management system (RDBMS).

Most Web documents in China are in Chinese but some are in English. Sometimes, Web documents are bilingual (mainly English and Chinese). In this case, the documents are split into as many resource profiles as languages used.

The structure of the descriptor field is shown as follows:

     Descriptor

                          level 1:

                          level 2:

                          level 3:

                          .

Categorizing

No matter how good the search and ranking algorithm is, however, it does not guarantee that the search references with high relevance scores are actually relevant to the user's need. A user often has specific search interest while search engines return many references little related to user's interesting fields. For instances, when a computer science student submits keyword 'ATM' to the search engines, he expects the returned results are some pages related to Asynchronous Transfer Mode, rather than Automated Teller Machines. One of the approaches to alleviate this problem is to focus the search on the relevant categories. This will dramatically reduce the number of references returned and increase relevance. Most users know what categories his interests will be in.

Our primary objective is to create a more useful system for information retrieval by narrowing the search scope to a set of specific web pages related to certain type of knowledge. Our method is to classify the collected information into main categories which is defined by a number of experts from the cataloging, digital libraries and other related fields.

An intelligent categorizing module is developed to classify the Web documents into particular categories, based on the knowledge base. It extracts a list of significant phrases and keywords from the documents using different techniques depending on the language of document, then compares the extracted high-quality keywords with the predefined set of representatives of the particular category.

Benefited from the results of summarizing, we can get sentences sorted by their importance in the descriptor field of the resource profile. For Chinese resource profiles, the document descriptor should be segmented into words with white space by word segmentation module. The intelligent module evaluates each word to a weight considering occurrence frequencies and their levels. As a result, words with high weight value are saved in resource profile as keywords.

The knowledge base consists of two parts: representatives of each category and rules. The complexity of the knowledge base depends on the number of categories.

Query Server

The procedure of retrieving information is as follows.

(1) User submits a query, with search options such as preferred categories, to the query server.

(2) The encoding detection module is invoked to detect the language of the query.

(3) The query is reformulated by the query server and passed to appropriate search engines.

(4) The search engines work in parallel and return ranked results to the query server. Each result is hypertext-linked to the Web site where contains the expected document.

(5) The query server reorganizes the results and returns them to the user.

Search Engine

Each search engine is implemented to perform indexing, retrieval based on a particular category. It periodically retrieves the category matched resource profiles in compressed stream from every information gatherer. After knowing new resource profiles are arrived or some URLs are out-of-date, the search engine may choose a good opportunity to update its index. Upon receiving a request from the query server, the search engine performs searches based on the pre-created index, and then returns the results listed in relevancy order to the query server. The results contain the URL pointed to the user wanted information together with the latest time it was fetched, the title of the document and short summary for the document.

Indexing

The process of indexing is to generate index objects and then create the inverted index on resource profiles. In our multilingual architecture, the search engine has as many index tables as the number of languages. The inverted index table has the following structure regardless of language type of the index objects:

Fig.2 The structure of the inverted index

Thus, when a user submits a query containing the word "network", the corresponding resource profiles are retrieved, the term frequencies (tf) are used in calculating the ranking score while the positions are used to determine if the consecutive query terms constitute a phrase.

In generating index objects, it is different for Chinese and English. The tasks of English-specific module for generating index objects are relatively easy. First, English words are reduced to a canonical stem by removal of suffices. There are also lists of "noise words" (or "stop words") and "noise stems" for English. Words in resource profile are compared exactly against the noise words, and after stemming, against the noise stems. The word will be ignored if a match occurs. Thus common invariant words, and common stems, can be excluded. Index objects are put in lower cases except words with letters all capitalized.

Unlike the English language, Chinese words are not separated by space and thus there are inherent difficulties in generating index objects. Word segmentation module is developed to do this hard task. Using word segmentation techniques, we generate index objects from descriptor field of the resource profiles. Both words and single characters can be indexed.

Retrieving

The search engine employ the so called TFxIDF search and ranking algorithm which assigns relevance scores to Web documents using the vector space model. Retrieved documents are ranked in decreasing order of similarity to the query.

The query strings can consist of phrases in any of the languages supported, as well as Boolean operators (AND, OR, NOT). For phrase search (e.g. computer networks), the system will search all resource profiles containing all of the contiguous words (e.g. computer, networks). The documents retrieved are then checked for adjacency of these words.

A fault tolerant search mechanism using an algorithm developed by V.I.Levenshtein is employed to support spelling error override searches for English information.

Some other searches such as fuzzy searches and proximity searches are also supported.

The organization based subsystem

Because China Internet Network Information Center (CNNIC) is exclusively responsible for domain name registration all over China, we maintain all of registration information generated from registration forms filled by each applicant. Most items in registration forms request both English and Chinese information. In order to make efficient use of the invaluable information, we developed the organization based subsystem to make the information public and provide searching functions. One feature of this system worthy of mention is that the system provides two searching ways: 1) Users can submit keywords or phrase query to search information about the organizations whose name or brief introducation contain the keywords or phrase. 2) Users can submit domain name to search information. The other feature is that we not only simply provide searching based on the information each applicant registered, but also add new information gathered by our detecting module.

The subsystem structure is depicted in Fig3. Extracting module picks up information such as organization's English name, Chinese name, brief introducation, organization contact information, responsible person information, contained in domain name registration forms submitted by applicants. Detecting module decides whether the web site associated to a domain name is exist or not by trying to connect to the host whose name is generated from domain name such as www.domainname or  web.domainname. Registration information database consists of the information provided both by extracting and detecting procedures.

Fig.3 The structure of the organization based subsystem

The search engine employ the similar indexing, retrieval and word segmentation technologies illuminated in Web-page based subsystem  to perform searches based on the pre-created index and return the results to the query server. The results contain the URL pointed to the organization's website, organization's name and contact information. The search engines update their index everyday with the egistration information database updating everyday.

The query server is responsible for analyzing query: if users submit query by domain name, then the query server directly passed the query to the specified search engine; if users submit query by keywords or phrase, then the query server invoke encoding detection module to detect the language and passed the query to specified search engine.

Future work

So far, We have incorporated only Chinese GB and English. We will integrate Traditional Chinese Big5 and EUC-TW to our multi-language environment.

Furthermore, we will augment our system with Natural Language Processing (NLP) capability. An NLP system can make the text segmentation more accurately, categorization more intelligent and return more relevant references to users.

References

  1. K. Moore, Representation of Non-ASCII Text in Internet Message Headers, rfc1342, 1992.
  2. Salton, G. and Buckley, C.: Improving Retrieval Performance by Relevance Feedback,Journal of the American Society for Information Science, 41/4, pp. 288-297, 1990.
  3. I.A.Macleod, P.Martin, B.Nordin, A design of a Distributed Full Text Retrieval System, in Proceedings of the 1986 ACM Conference on Research and Development in Information Retrieval, 1986, 131-137.
  4. I Y.Wei, Y.Zhang, J.Li, J.Ding, Y.Jiang, ASCII Printable Characters-Based Chinese Character Encoding for Internet Messages, RFC 1842, 1995.
  5. Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading MA, 1989.
  6. V.I.Levenshtein: Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady,Vol 10,1965.
  7. M.F.Porter, An Algorithm for Suffix Stripping, Program, V14,pp.130-137,1980.
  8. B. Pinkerton., Finding what people want: Experiences with the webcrawler. In Second International World Wide Web Conference Proceedings, 1994.
  9. Pfeifer, U., Fuhr, N., and Huynh, T., Searching Structured Documents with the Enhanced Retrieval Functionality of freeWais-sf and SFgate, Computer Networks and ISDN Systems, v.27, no.7, p1027-1036, 1995.
  10. JiangZhang He, Jack Xu, Aitao Chen, Jason Meggs and Fredric C.Gey, Berkeley Chinese Information Retrieval at TREC-5:Technical Report, December 1996.
  11. Alan Smeaton, Ross Wilkinson, Spanish and Chinese Document Retrieval in TREC-5, April 4,1997.
  12. Stephen Soderland, Learning to Extract Text-based Information from the World Wide Web, in Proceedings of Third International Conference on Knowledge Discovery and Data Mining(KDD-97).
  13. Pinkerton, B., Finding What People Want: Experiences with the WebCrawler, in Proceedings of the First International Conference on the World Wide Web, Geneva, Switzerland, May 1994.
  14. Barta, R. A., and Hauswirth, M., Interface-parasite Gateways. Proceeding of the Fourth International WWW Conference, December 1995, Boston.
  15. Weibel, S., et al., OCLC/NCSA Metadata Workshop Report, The March 1995 Metadata Workshop, http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html
  16. Bowman, C. M., et al., The Harvest Information Discovery and Access System, in Proceedings of the 2nd Intl. WWW Conf. pp. 763-771, Chicago 1994.
  17. Erik Selberg and Oren Etzioni, Multi-service search and comparison using the MetaCrawler. in Proceedings of Fourth Int. WWW Conf., Boston, December 1995, http://www.w3.org/pub/Conferences/WWW4/Papers/169
  18. M. Koster, Robots in the Web: threat or treat? ConneXions, Volume 9, No 4, April.
  19. Wynar, B.S., Introduction to Cataloging and Classification, 6th ed., Libraries Unlimited, Littleton, Colorado 1980.
  20. D. Hardy, M. Schwartz and D. Wessels, Harvest User's Manual, University of Colorado, Boulder, USA, April 1995, http://harvest.cs.colorado.edu/harvest
  21. Koster, Martijn, ALIWEB - Archie-Like Indexing in the Web, Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
  22. Armstrong, R. Freitag, D., Joachims, T., and Mitchell, T. (1995), WebWatcher: A learning apprentice for the World Wide Web.
  23. Gavin Nicol, The Multilingual World Wide Web, 16 October 1996, http://www.ebt.com/docs/multling.html
  24. Mischael G., Multilingual Databases and Global Internet Intergration, http://www.isoc.org/isoc/whatis/conferences/inet/97/proceedings/A8/A8_2.HTM