A Web Search Engine for Indexing, Searching and Publishing Arabic Bibliographic Databases

Ibrahim A. Al-Kharashi ( kharashi@kacst.edu.sa )
Computer and Electronics Research Institute
King Abdulaziz City for Science and Technology
P. O. Box 6086, Riyadh 11442
kharashi@kacst.edu.sa
Phone: 481-3273
Fax: 481-3274

Abstract

With the recent introduction of the Internet to the Arab world, there is an increase need for methods and tools to publish Arabic content information. Although Web pages with Arabic script constitute a very small portion of the World Wide Web, there is a need for tools to enable non-HTML based databases and textual data, with Arabic scripts, to be indexed, searched and published.

This paper introduces the basic architecture of a search engine capable of indexing, searching and displaying results through the Web. It details the internal data structure, methods for handling Arabic text, and searching capabilities. In addition, the paper emphasizes different techniques used for presenting Arabic text on non-Arabic environment systems.

The proposed engine basically provides two ways for locating records in a bibliographical database. The first method enables the user to browse through predefined categories (subject or alphabetical listings). The second method allows free text searching against the fielded database content. Results are displayed using a hit list Web page, and the whole record is displayed on the users browser screen upon selection of an entry.

Keywords:

Search Engines, Arabic Language Processing, Bibliographic Databases.

Table of Contents

Introduction

Internet Search Engines

Bibliographic Database on the Internet

Arabic Language and the Internet

Arabic Language Presentation

Arabic Code Pages

Arabic Morphology

Solutions for the Internet

Data Structure for Arabic bibliographic Database

Header File

Index-Record File

Key-Post File

Directory File

Search Engine Functionality

Data preparation

Indexing

Searching

Displaying

References

Introduction

King Abdulaziz City for Science and Technology (KACST) [1] is a scientific institution established to promote research and development in the field of science and technology in the kingdom of Saudi Arabia. KACST is hosting several scientific databases focusing on the field of science and technology. The database include terminology databank, manpower database, and several bibliographic databases. Among the bibliographic databases is the national bibliographic database which consists of more than 100,000 records. This database is physically split into Arabic and English database. The Arabic collection contains more than 35,000 records citing papers, books, reports … etc written in Arabic and related to the scientific development in the Kingdom of Saudi Arabia.

Bibliographic database and other collection are processed and accessed through proprietary limited information indexing and retrieval system available on an IBM based mainframe. Rough data to fed to the indexing system is created out of data entry system using relational database management system available on Unix based hardware.

Bibliographic databases define 36 field mostly used by the English bibliographic database. Only about twenty fields are mostly used by the Arabic collection. Table 1. shows list of defined fields and some statistics about their usage within the Arabic collection. Figure 1. shows simple Arabic bibliographic record from the collection.

This paper presents a search engine to be used to index, search and display the content of the Arabic bibliographic database on the Internet. It details the internal data structure, methods for handling Arabic text, and searching capabilities. In addition, the paper emphasizes on different techniques used for presenting Arabic text on non-Arabic environment systems.

Table 1.

Figure (1) Sample Arabic bibliographic record.

Internet Search Engines

In general, major internet search engines allow user to browse the content of the internet using two different methods. In the first method, search engine, known as index-based search engine, locates and fetches all accessible internet resources. Search engine then extracts most or all useful and meaningful keywords. Those processes are automated using internet robot. Keyword list, or the index is made available for the Internet users to search for internet resources using known keywords or phrases. Very large and well known internet search engines are Altavista [ 2 ], hotbot[ 3 ] and NorthenLight [ 4 ].

In the second method, however, a search engine, known as Internet Directory, categories the available Internet resources in a hierarchical manner. Such directories are updated and maintained by human been and always precise in presenting related information to the user. Yahoo [ 5 ] is one of the oldest and biggest directory available on the net.

Bibliographic database on the Internet

Bibliographic databases are commonly published and accessed through library automation systems. CD-ROM is a widely used media to publish and distribute commercial bibliographic databases. Major players in the field of CD-ROM publishing business are Dialog Corporations [6] and SilverPlatter Information,Inc.[7].

Nowadays, Many organization are offering their bibliographic databases through the Internet. Organizations, includes academic libraries, special organizations, commercial bookstores, etc … . Furthermore, special protocol, namely Z39.50 (the American National Standard Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection) [8], started to serve bibliographic community of libraries and bibliographic utilities [9].

Organizations with collections in languages other than English use different techniques to publish their collection on-line, on CD-ROM or on the Internet. As far as the Arabic bibliographic collection concern, few techniques has been adopted. Next section will discuss in brief some techniques used in handling Arabic language in different computing environment.

Arabic Language and the Internet

Arabic, one of the Semitic languages, is written from right to left. Its alphabet has twenty-eight consonant letters, three of them () considered also to be long vowels. Optionally, one of three short vowels () can be placed after some characters where ambiguities (in pronunciation and/or meaning) might arise. In a fully vowelized Arabic text, absence of vowel can be indicated by sokon (silence) symbol (). In certain cases, double letters, can be replaced by single letter with the tashdeed (strengthening) sign () placed over it.

Arabic Language Presentation

In the early days of computing, processing of Arabic language required the use of romanization or transliteration technique where each character in the Arabic text is substituted by one or more Latin characters [10] [11]. Then came special dummy terminals that capable of handling entry and display processes for Arabic (and bilingual) text. With the introduction of personal computers in the early 80s, many solution were introduced to support Arabic language, including hardware and software based Arabization solutions [12]. Nowadays, personal computers with very advance operating systems, such as Microsoft Windows and Macintosh OS, can support multiple languages with various fonts.

In a very simple written Arabic, each letter occurs in up to four presentation forms depending on its position within the text (initial, medial, final or isolated). Contextual analysis algorithm is usually used to determine shapes of printed or displayed Arabic text[13].

Arabic Code Pages

Due to the lack of standardization and the fast growth of information business in the Arab region, over 25 Arabic code pages (to define needed character set for Arabic information processing and exchange) were invented [14] [15]. At present, only few code pages are in use, including MS-Windows Arabic code page CP-1256 which became a de facto standard and the international standard ISO-8859-6 used by Arabized Macintosh system. With the introduction of Unicode character set, Universal Multiple-Octet Coded Character Set (UCS) [16], It seems that hardware vendors, software developers and users eventually going to settle with just single code page.

Arabic Morphology

Morphologically, Arabic language is very rich and based on root-pattern structure, Most of Arabic words are generated out of finite set of roots (about 7000) transformed into stems using one or more of patterns (about 120). In theory, single Arabic root can generate hundreds of words (noun, verbs, …). Arabic word may exist in hundred shapes in normal text by adding certain suffixes and prefixes (mostly considered as stopwords in English language).

Striping out affixes and normalizing words is an essential part of any information retrieval and search engine systems. Normalization is done through stemming algorithm. Linguistically, normalization of an Arabic word goes through timely consumed process known as morphological analysis. The process goes through tow distinguished stages. In first stage, the analyzer strips out all affixes and prefixes and reduce the word to its singular form. In the second stage, the analyzer produce the root of the word. In most practical cases, and to increase the quantity of retrieve records without decreasing the quality, it is prefer to use the stem of the word for indexing and searching rather than the root [17],

Solutions for the Internet

The Internet evolved in an environment that use Latin alphabet for communication and processing. The need for tools to support non-Latin languages use is in rise. For the Arabic language, many solutions were suggested to handle and process Arabic script on the Internet including solution for display and browse, process and search and communicate and e-mail [18] [19] [20] [21] [22]

Data Structure for Arabic bibliographic database

For each bibliographic database set of files must be defined, including header, index-record, key-post and directory files. The following is detailed description of each group of files.

Header file defines some basic attributes for the database and its fields. Among attributes defined for the database are: Encryption and Compression flags, code page used to code database text, database stamp, several date and time stamps, phrase/directory separator symbols, and database name and its textual description (in both Arabic and English).

In addition, the header file provides detail descriptions for each bibliographic field. A database may contains up to 250 fields which is adequate for most bibliographic usage. Each field in the database identified by its name and its attributes. Field name holds abbreviated and full name of the field in both Arabic and English. Field attributes is an array of bits that describes and controls several behavior of the system during indexing, searching and displaying. Field’s attributes is grouped into four groups and the following is a brief description of each group.

Field Type group: defines the data type of the field. A field can be textual, phrase and/or a directory entry. Out of textual field, all valid Arabic/English keywords (excluding stopwords) will be extracted. For fields of type phrase, two or more adjacent token will be extracted and added to the keyword list. Extracting phrases in this way is very useful when searching for abbreviated names and symbols (e.g. K.S.A for Kingdom of Saudi Arabia). Finally, a separate listing will be generated for each field of type directory. This allows users to browse through the content of the database by navigating using different entries such as authors, publishers, subjects … etc. For fields of type phrase and directory, token separator symbols are defined in the header file.

Load and Index group: controls the process of creating the database and its indices. It controls whether the content of a specific field is to be included in the database (vopied form the rough data file to the index-records file). It tells if the content of the field is index-able and whither to exclude stopwords.

Search group: defines whether the field content is searchable and if it is a default searchable. The engine supports field search (user of the engine can specify what field(s) to search). If no field is specified in the query, then all default fields will be searched.

Display group: determines the way in which the engine displays results of a given query or request. It defines what field(s) to use for displaying results of a query or request. It also controls the orientation (left to right or right to left) of the field text when displayed on the browser screen and whether to use the abbreviated or full name of the field.

Index-Record files describe the layout of the actual bibliographic data in the database. For each bibliographic record in the database an entry in the index file is defined. The index record indicates number of bytes of the bibliographic record and its offset in the Record file. A record in the Record file contains the content of a single bibliographic record. The bibliographic record is preceded by a fixed length header. Each cell in this header contains the character count of a given field in bibliographic record. The text of a field is terminated by null character. Figure (2) shows simplified layout of the Index-Record file

Figure (2) Layout of the Index-Record files.

Key-Post files store the indexing information of a single database. As shown in Figure (3), all unique keywords and phrases (extracted from the text of the database) are stored in an ascending order in the Key file. Information about occurrences and locations of each keyword or phrase is stored in the Post file. Posting information contains record and field ids, word/phrase count within the field and offset of the starting character of the keyword/phrase within a the field (used for keyword/phrase highlighting during record display),

Figure (3) Layout of the Key-Post files.

Directory indices files one or more file to store directory information for the database. One or more file will be created for each field that ought to be viewed through the directory. The directory can be simple listing of all entries of a specific field (e.g. author listing), or multiple level of listings (e.g. subject listing).

Search Engine Functionality

Basically, the search engine provides two ways for locating records in the bibliographical database. The first method enables users to browse through predefined categories (subject or alphabetical listings). The second method allows free text searching against the fielded database content. Results are displayed using a hit list Web page, and the whole record is displayed on the users browser screen upon selection of an entry.

Figure (4) shows overall process for the search engine. Rough bibliographic data is to be arranged and structured in a suitable and an easy to access format explained in previous section. Extracted records then will be indexed and arranged in a searchable format. User’s queries are processed and results are send as a Web page. Following is description of the engine processing cycle.

Figure (4) Overall process for the search engine.

Data preparation

Original data, in the form of bibliographic records, is stored as plain text in a flat file. Each record is terminated by record delimiter. A field text is preceded by field name tag. The field name can be a three character short name or up to 17 character full name. Full name can be enclosed between double quotes if needed. The field content can span over multiple lines.

During the extraction of a bibliographic record, a small data structure will be used to hold the information of one complete bibliographic record at a time. That structure will be processed twice, once during the updating the index-record files, and once to extract tokens and update the unsorted keyword/phrase file. Extracting field text from the rough data is controlled by set of flags defined in the header file.

Indexing

The content of an index-able field will be used to generate keywords and/or phrases. The process is controlled by set of attributes assigned to each character in the used code page.

Entries in the code page are divided into four groups, namely, Arabic, English, Numeric and Control groups. Single entry in the code page can be tagged to one or more of the above groups (for instant period "." is classified as numeric and control). Grouping is done through creating an attribute table with 256 entries. Each entry is corresponding to one character in the used code page as shown in Figure (5). Classification of entries in the code page makes it very easy to extract tokens from plain text. The process is shown in Figure (6)

Figure (5) Attribute table.

Figure (6) Token Extraction process.

Token extraction is followed by normalization process which is controlled by set of flags. After sorting the normalized token list, the key-post files will be generated. Normalization process is controlled by the following flags:

nf_upper: To convert English token to upper case.

nf_lower: To convert English token to lower case.

nf_kashedah: To remove Kashedah from Arabic tokens. Arabic word processors and text editors allow users to decorate Arabic text by inserting Kashedah (hex DC in CP-1256 code page) in between characters of a given word. The presence of such symbol in the a keyword list will badly effect sorting order of token list and hence make searching very difficult.

nf_vowel: To remove all vowels (including SHADDAH) from Arabic token.

nf_space: To reduce multiple spaces in a phrase token to a single space.

nf_stem: To reduce Arabic/English token to its stem.

nf_root: To reduce Arabic token to its root.

Searching

The structure of the search engine allows for two methods of accessing the bibliographic database, navigating and searching. Navigation is achieved by zooming in from one level of category in a given field to more detailed level. Navigating process though the author field for instant, starts by selecting one of the alphabet letters. Sub-list of author names with starting letter will be displayed. Users then can select to display another sub-list of author starting with another letter, show next sub-list for the same letter or display list of records associated with one of the listed authors.

The other accessing method for the database is done through keyword search. The structure of the key-post files allows for different type of searching, including, single keyword search, Boolean search (using AND, OR and NOT), field search (the specified keyword must exist in the text of a specific field(s)), distance search (two keywords should be within a specific range), and phrase search. Search result is displayed as a sub-list of short bibliographic records. User then can request for next sub-list, display detailed bibliographic record or submit another query.

In any type of search, it is allowed to use exact word, stem or root of a given keyword.

Displaying

Interaction between user and search engine is managed through a CGI based interface. Queries and results are exchanged as HTML pages. Handling Arabic script on the Web was detailed in [18][18][20]. The engine publishes search results as an HTML pages using one of the supported code pages. The code page supported by the client machine is detected automatically.

If the client machine has no support for the code page used by the engine, then the engine will convert the content of the page on fly to the client code page and send it. In many cases where the client machine is not supporting any code pages known to the engine then, the results will be send to the user as an argument to a special Java applet. The Java applet will manage displaying results in the client machine, and handling Arabic data entry for coming requests.

References

King Abdulaziz City for Science and Technology http://www.kacst.edu.sa

Altavista http://www.altavista.com/

Hotbot http://www.hotbot.com/

Northen Light http://www.northernlight.com/

Yahoo http://www.yahoo.com/

Dialog Corporation http://www.dialog.com/

SilverPlatter Information,Inc. http://www.silverplatter.com/

Maintenance Agency page for International Standard Z39.50 http://lcweb.loc.gov/z3950/agency/

Z39.50 Client and Web Gateway Surveys http://www.dstc.edu.au/RDU/reports/zreviews/

System of Transliteration of Arabic Characters http://arabic.wjh.harvard.edu/ref/translit/other/translit.htm

Transliteration of Arabic Script http://www.edesign.demon.co.uk/translit.htm

Al-Muhtasib, H. A., Rasool, M. A., and Khayad, M, G., "A study of the Arabized Personal Computer System", Proceeding of the First King Saud University Symposium on Computer Arabization, April 1987, pp. 77-88

Al-Kharashi, I. A., "An efficient contextual analysis algorithm for Arabic text handling", The 12th National Computer Conference, King Saud dUniversity, Riyadh, Saudi Arabia, October 21-24, 1990, Vol. 2, pp. 465-473.

Al-Badr B. "Arabic Character Sets: Towards a Unified Standard." Report No. C-2, August 10, 1997, Computer and Electronics Research Institute, KACST, Riyadh.

SEDCO Arabization Guide, SEDCO

International Organization for Standardization, ISO/IEC 10646-1:1993. International standard --Information technology -- Universal multiple-octet coded character set (UCS) -- Part 1: Architecture and basic multilingual plane.

Al-Kharashi, I. A., "Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System", Journal of the American Society for Information Science, Vol. 45 No. 08 , September 1994, pp 548-560.

WWW Arabic utiltites, http://www.ayna.com/download.html

How to Read Arabic Text on the Internet, http://www.arabic2000.com/help/ehowa.html

Al-Badr, Badr H., "Using the Internet in Arabic: Problems and Solutions" , Inet98 Proceedings, 1998, http://www.isoc.org/inet98/proceedings/5f/5f_1.htm

Al-Kharashi, I. A., "Towards an Ideal Arabic Search Engine on the Internet", The First Workshop on Internet Arabization Technique, King Saud University, 18 May 1997, pp 15-17,

Al-Kharashi, I. A., "Search Engines and Arabic Language", Kuwait Conference on the Information Highway, March 16-18, 1998.