Based on a key project of Chinese Academy of Sciences which has been accomplished in June,1998, this paper provides our solution to the above problems. The language accepted are simplified Chinese and English, Traditional Chinese (Big5) and Traditional Chinese (EUC) will be added in the near future.
Our established system consists of two subsystems: organization based subsystem and Web-page based subsystem. Organization based subsystem is developed specially for users to find whether a certain organization in China has its own Web-site and some detail information about that organization. Web-page based subsystem is developed for users to searching information in Web documents.
We develop our Cspider to traverse the Web on China Internet to ensure that nearly all of the Web servers, each with almost all of the pages, are represented in our database. We create cooperating and distributing Cspiders that share the work of finding things on the Web without duplicating each other's efforts.
Users can conduct searches using uniform query interface that is both for Chinese and English, the character encoding detection module determines the language and sends request to the database supporting the corresponding language. A set of information retrieve functions that natively support Chinese and English searches are available: Boolean query (and, or, not), phrase query, fuzzy query, proximity query, stemming, language parsing, spelling error override.
To remedy the irrelevant problem in information searching, we extract attributes and form a resource profile for each searched document. The resource profile is classified into different categories and indexed with keywords. When users submit their queries, they should also specify the categories which the expected results are likely in. Moreover, a weighting algorithm is designed to determine the relevance and rank results in order of relevancy.
In order to maintain up-to-date information, a time-to-live value (TTL) is set for each stored object. An object remains unchanged until it expires. The expiration of the TTL indicates that the related document must be checked again by invoking the Cspider. Periodically, the Cspider retrieves new Web information and suppresses duplicates.
Fig.1 The Web-page Based Subsystem Structure
The Cspider generates a visiting profile with the "Top" URLs, scope, search depth, and other parameters. Later on, the visiting profile can also be dynamically expanded with URLs sent from other Cspiders. The way the Cspider traverses Web pages is in breadth-first order. The Cspider downloads the page indicated by the known URL, marking it as having been processed, extracting any outbound links, caching the downloaded document for later summary and categorization. Non-indexable files, such as pictures, sounds, PostScript, or binary data are not processed. Taking one of the links that leads to a new page within the local scope, the Cspider repeats the above process. Before visiting a Web site, the Cspider checks the predefined scope to know whether the site is within the local scope or not. If not, the Cspider will send a request to the corresponding Cspider which is working on the right scope.
When people update their Web servers, documents may be moved or deleted, content of Web pages may be changed, and hypertext links may be broken. We designed a strategy to update the resource profile periodically. For each downloaded page, the information gatherer keeps unique identifier, TTL, and other values extracted from the resource profile. The TTL value determines how often Cspider should check the site. When the gatherer determines that a URL should be revisited, it marks these URLs as needed to be traversed again. Then the Cspider would receive revisiting commands from the gatherer, and it first issues a HEAD request for the page before downloading the whole one. Some servers return the information contained in the HTTP response header, including the length in bytes and the last modification date of the resource. This information can be used to determine whether the resource profile needs to be updated. If a Cspider finds some pages are unreachable, it informs the related search engines.
Chinese language contains a large set of characters, about several thousands for common use. All characters can be represented by GB codes. Each Chinese character consists of two bytes with the most significant bit of each byte set to 1. This enables us to distinguish between Chinese and English characters. There are some special characters in Chinese documents, such as punctuation, symbols, Roman letters, digits, etc. represented by GB codes too, that means they will occupy two bytes instead of one as in English documents.
To segment a sentence into words, we implement a procedure based on a dictionary containing general Chinese words, idiom list and some professional word lists about special field of study. In the procedure, the special characters in Chinese documents are used to split a sentence into sub-sentences. A Chinese stop-word list is created and used to break the sub-sentence into groups of words. And words in a word group are sequentially scanned to match the dictionary.
For further separation of word group into meaningful words, there are three basic methods.
a) Longest match, a word group is sequentially scanned to match the dictionary. The longest matched string is taken as a word. The remaining part of the word group will be processed similarly.
b) Shortest match, a word group is sequentially scanned to match the dictionary. The first matched string is taken as a word. The remaining part of the word group will be processed similarly.
c) Overlap match, two matched words can overlap each other across the matching boundary.
Longest match will generate less and longer words, when user submits a string shorter than the words, there may appear no matching result. Shortest match will cause another problem---miss-segmentation and may also lead to no matching result. For instance, a group of words such as C1-C2-C3-C4-C5-C6, using shortest match, the C1-C2, C3-C4, C5-C6 are produced, but actually, it should be C1-C2-C3, C4-C5-C6. Using the overlap matching, the result C1-C2, C2-C3, C3-C4, C4-C5, C5-C6 may be produced, this will guarantee precision. In our case, we combine the shortest match and the overlap match.
Sentences in one html document are not equally important. Page titles, headings, hyperlink anchor names, words in bold-face, words in italic, and words in the first sentence of every list item are more important than other words or sentences in the same document. So HTML tags are the indicators of the information importance. Based on HTML tags, we classify sentences enclosed by tag pairs into different important levels and formulate document descriptor. Page titles, headings, hyperlink anchor names are ranked as level 1. For short document, all of the words except the above items will be ranked as level 2. For long document, the sentences enclosed by HTML tags such as 'th', 'caption', 'pre', 'li' are also in level 2. The sentences enclosed by HTML tags such as 'tr', 'td', level 3, and so on. If the tag-enclosed sentences is within 30 words, it will be kept as a whole. Otherwise, the first paragraph and the first sentence of other paragraphs will be kept.
The resource profile consists of the following fields: URL, Identifier, Title, Update Time, TTL, Size, Language, Category, Last-visit Time, keywords, and document descriptor. This format has some similarity to that of other meta-information definitions and we will follow whatever meta-information standard evolves. The resource profiles are stored in a relational database management system (RDBMS).
Most Web documents in China are in Chinese but some are in English. Sometimes, Web documents are bilingual (mainly English and Chinese). In this case, the documents are split into as many resource profiles as languages used.
The structure of the descriptor field is shown as follows:
Descriptor level 1: level 2: level 3: .
Our primary objective is to create a more useful system for information retrieval by narrowing the search scope to a set of specific web pages related to certain type of knowledge. Our method is to classify the collected information into main categories which is defined by a number of experts from the cataloging, digital libraries and other related fields.
An intelligent categorizing module is developed to classify the Web documents into particular categories, based on the knowledge base. It extracts a list of significant phrases and keywords from the documents using different techniques depending on the language of document, then compares the extracted high-quality keywords with the predefined set of representatives of the particular category.
Benefited from the results of summarizing, we can get sentences sorted by their importance in the descriptor field of the resource profile. For Chinese resource profiles, the document descriptor should be segmented into words with white space by word segmentation module. The intelligent module evaluates each word to a weight considering occurrence frequencies and their levels. As a result, words with high weight value are saved in resource profile as keywords.
The knowledge base consists of two parts: representatives of each category and rules. The complexity of the knowledge base depends on the number of categories.
(1) User submits a query, with search options such as preferred categories, to the query server.
(2) The encoding detection module is invoked to detect the language of the query.
(3) The query is reformulated by the query server and passed to appropriate search engines.
(4) The search engines work in parallel and return ranked results to the query server. Each result is hypertext-linked to the Web site where contains the expected document.
(5) The query server reorganizes the results and returns them to the user.
Fig.2 The structure of the inverted index
Thus, when a user submits a query containing the word "network", the corresponding resource profiles are retrieved, the term frequencies (tf) are used in calculating the ranking score while the positions are used to determine if the consecutive query terms constitute a phrase.
In generating index objects, it is different for Chinese and English. The tasks of English-specific module for generating index objects are relatively easy. First, English words are reduced to a canonical stem by removal of suffices. There are also lists of "noise words" (or "stop words") and "noise stems" for English. Words in resource profile are compared exactly against the noise words, and after stemming, against the noise stems. The word will be ignored if a match occurs. Thus common invariant words, and common stems, can be excluded. Index objects are put in lower cases except words with letters all capitalized.
Unlike the English language, Chinese words are not separated by space and thus there are inherent difficulties in generating index objects. Word segmentation module is developed to do this hard task. Using word segmentation techniques, we generate index objects from descriptor field of the resource profiles. Both words and single characters can be indexed.
The query strings can consist of phrases in any of the languages supported, as well as Boolean operators (AND, OR, NOT). For phrase search (e.g. computer networks), the system will search all resource profiles containing all of the contiguous words (e.g. computer, networks). The documents retrieved are then checked for adjacency of these words.
A fault tolerant search mechanism using an algorithm developed by V.I.Levenshtein is employed to support spelling error override searches for English information.
Some other searches such as fuzzy searches and proximity searches are
also supported.
The subsystem structure is depicted in Fig3. Extracting module picks up information such as organization's English name, Chinese name, brief introducation, organization contact information, responsible person information, contained in domain name registration forms submitted by applicants. Detecting module decides whether the web site associated to a domain name is exist or not by trying to connect to the host whose name is generated from domain name such as www.domainname or web.domainname. Registration information database consists of the information provided both by extracting and detecting procedures.
Fig.3 The structure of the organization based subsystem
The search engine employ the similar indexing, retrieval and word segmentation technologies illuminated in Web-page based subsystem to perform searches based on the pre-created index and return the results to the query server. The results contain the URL pointed to the organization's website, organization's name and contact information. The search engines update their index everyday with the egistration information database updating everyday.
The query server is responsible for analyzing query: if users submit query
by domain name, then the query server directly passed the query to the
specified search engine; if users submit query by keywords or phrase, then
the query server invoke encoding detection module to detect the language
and passed the query to specified search engine.
Furthermore, we will augment our system with Natural Language Processing (NLP) capability. An NLP system can make the text segmentation more accurately, categorization more intelligent and return more relevant references to users.