INET Conferences |
|||||||||||||||||||||||||||||||||
|
Using the Internet in Arabic: Problems and SolutionsBadr H. AL-BADR <badr@kacst.edu.sa> AbstractThis paper addresses the support required for Arabic on the Internet in the fields of content, transport, client processing, and server processing. The problems in each category are discussed and the solutions are surveyed, with the new Internet protocols that facilitate using Arabic on the Internet being taken into consideration. One of the major problems that faces the use of Arabic is the plurality of character sets. Transporting Arabic text over the Internet is problematic because of its non-ASCII character sets. Major among the client-processing issues is display of Arabic text. The display features of Arabic text set it apart from other languages in several ways. These features necessitate specialized text-display algorithms. One of the most important server-processing issues for Arabic text is the problem of search and indexing. These operations are more involved in Arabic than in many other languages. Solutions have started to emerge with browsers and mail programs building on new Internet standards such as Multipurpose Internet Mail Extensions (MIME) and HTML 4.0. The trend towards Unicode also helps the exchange of Arabic on the Internet. The paper concentrates on the issues of character sets, bidirectional display, Arabic e-mail, Arabic Web browsing, and search and indexing of Arabic text on the Internet. Contents
IntroductionFor the Internet to be truly international, it must support the diverse languages of the world. Arabic, a language spoken by millions of people worldwide, is being increasingly used on the Internet, despite the confronting obstacles. A major obstacle facing Arabization of the Internet is the lack of standards, particularly in the field of character sets. The Internet is a heterogeneous environment composed of different configurations of hardware and software (transport and net equipment). Standards are the way to get the different parties on the Internet to agree on how to format and exchange information. Other obstacles were uncovered by a survey conducted at the beginning of 1997 on the perceived problems facing Arab-speaking Internet/intranet users [1]. Users ranked the obstacles to greater Internet usage in the Arab world in the following order: weak telecom infrastructure, lack of Arabic content on the Internet, and lack of Arabic Internet access programs for the Web and for e-mail. (Most of the survey respondents were in Saudi Arabia.) The support required for Arabic on the Internet can be categorized in the fields of content, transport, client processing, and server processing. A certain level of support is required in each category. The support required is not all unique to Arabic. In fact, internationalization (i18n) is an active field of research in Internet technology. Arabic content (textual content, to be specific) relates to representing the data itself (using character sets) and to formatting it. Formatting is specified by Internet standards such as HTML in the case of the World Wide Web pages and RFC 822 and MIME in the case of e-mail messages. The Transport protocol is HTTP (HyperText Transfer Protocol) for the Web and SMTP (Simple Mail Transfer Protocol) for e-mail. Client processing includes generating, displaying, and interacting with Arabic text, while server processing includes storing, processing, searching, and providing Arabic content. Most of these issues are addressed below. One of the major problems that faces the use of Arabic is the plurality of character sets. Transporting Arabic text over the Internet is problematic because of its non-ASCII character sets. Major among the client processing issues is display of Arabic text. The display features of Arabic text set it apart from other languages in several ways: Arabic text is cursive, and the shapes of its characters depend on their position in the word. Most Arabic characters connect to one another when they are written in the same word. The directionality of Arabic text is peculiar: While Arabic text is written right-to-left, Arabic numbers are written left-to-right. This feature and the frequent need in everyday use of combining Arabic and Latin text on the same line necessitate handling of bi-directional text. These features affect the display of Arabic text in mail programs and Web browsers. One of the most important server processing issues for Arabic text is the problem of search and indexing. These operations are more involved in Arabic than many other languages. The representation and transport problems are external to Arabic, meaning that they are not related to the features of Arabic text. Rather, they are byproducts of Internet protocols originating in the Western world, which uses Latin characters. These problems are shared with many other languages of the world. While the display problems relate to the features of Arabic, they do not affect transport of Arabic text. Solutions have started to emerge with browsers and mail programs building on new Internet standards (such as MIME). The trend toward Unicode also helps the exchange of Arabic on the Internet. Other interim solutions are frequently used, such as encoding text as graphics and relying on ad hoc rules in Web servers to guess the Arabic capabilities of browsers and send information accordingly. The trend set by the Internet standard setters toward internationalization of Internet protocols is also very encouraging (e.g., [2]). Character setsThe character set serves a major role in any information processing or exchange system. Text, the building block for human-understandable information, is encoded on computers and transmitted through networks in the form of integers, because computers at their lowest levels can deal only with numbers. The character set serves as the table for conversion between the textual and numeric forms. Many times, the term "character set" is used to mean
different things. A definition of the term and its related terms
follow [3]:
The knowledge of the name of the character set used is needed for the correct transport (or encoding if needed) and the ability to decode the text at the other end. Completely specifying the parameters of a textual transmission requires both (1) a set of labels for specifying the character set, encoding scheme, and transfer syntax used, and (2) a technique for attaching these labels to the data. The labels are typically registered with the Internet Assigned Number Authority (IANA). Specifying the character set can be done through MIME headers, which will be shown below. Arabic character setsHere we review the history of Arabic character sets. In 1981, CUDAR-U appeared as the first standard Arabic character set (which used 7 bits per character). In 1982, the Arab Standards and Metrology Organization (ASMO) produced its first character set standard, AMSO-449 (7 bits). It became the basis for all subsequent standard sets. In that, it has a role similar to ASCII for Latin characters. In 1986, ASMO standard 708 appeared (8 bits), and became the international standard ISO-8859-6 [4]. Since then, it has gained widespread acceptance and was used particularly in the Arabized Macintosh system. In the 1980s, more than 20 Arabic code sets coexisted, most of which were p-code sets. With the spread of personal computers and the MS-Windows operating system in the 1990s, Microsoft's MS-Windows Arabic code page (MSCP-1256) became almost a de facto standard (a situation not unique to Arabic!). Microsoft opted not to use the standard 8-bit character set and developed its own to allow simultaneous use of Arabic and French and use of display control characters. UnicodeIt is believed that the Universal Character Set (UCS, Unicode) has the potential to solve the problem of the plurality of character sets [5]. This position is supported by the following reasons: It is the strategic direction of major software and Internet developers. It is also promoted on the Internet as the character set of choice in new protocols, while older protocols can use its various encodings. Finally, a study on the suitability of the Unicode representation of Arabic found that it is suitable for this task [6]. The Universal Character Set was developed jointly by the International Standards Organization (ISO-10646 [7]) and the Unicode consortium ([8]). The most important feature of this set is that it uses 32 bits to code virtually all characters of the world, and it codes over 35,000 characters. The Arabic coding in Unicode coincides with the ASMO-449 code page. UCS has various transfer encoding syntaxes such as: UCS-4 (32 bits per character), UCS-2 (16 bits), UTF-16 (multiple 16 bits), UTF-8 (multiple 8 bits), and UTF-7 (multiple 7 bits). Unicode has detailed character property tables and algorithms (e.g., bi-directional text display), which are particularly suited for Arabic. Further, it provides characters for text directionality. Arabic display issuesRendering Arabic textWhen discussing the display of characters, it is important to distinguish between the characters themselves and their visual representation, called glyphs. While a character is a letter, a series of characters are visually represented as a series of glyphs. This is particularly important in Arabic, where shapes of characters depend on the context. To display Arabic text correctly, a context analysis program is needed to select the right shape of a character (glyph) depending on the context. The context is not necessarily the preceding and following characters only. Arabic script is highly decorative, and many ligatures (a glyph for multiple characters) are used, especially in stylized fonts. This implies that Internet client programs that display Arabic (such as Web browsers) must employ contextual analysis or rely on an underlying operating system to do that. Finally, Arabic has a number of diacritic marks that are written above and below the characters to aid in pronunciation. The diacritics must naturally be displayed in their places. Displaying bi-directional textArabic text is stored in logical (reading) order. Before it is displayed, it must be reordered correctly on the screen. This is an important issue because most computer systems are designed to display text left-to-right, and also because bi-directional text must be simultaneously displayed on the same text line (e.g., Arabic words and numbers). Unicode defines a direction property for each character and provides a text directionality algorithm for the display of bi-directional text [8]. The directional property of Arabic and Hebrew characters is strong right-to-left, while the characters of other languages are strong left-to-right. The text directionality algorithm uses a set of directional ordering codes to influence the ordering of text. These codes are used for embedding one language into another (e.g., RLE) and for overriding the default direction of text (e.g., RLO). The algorithm is rather involved, so the details are left out. Arabic e-mailThe specification of electronic mail on the Internet has two major components: mail transport and mail message format. Mail transport over the Internet is governed by the Simple Mail Transfer Protocol (SMTP) [9], which is an application-level protocol that runs over TCP, and by the newer Extended SMTP [10]. The format of Internet e-mail messages is specified in "RFC 822: Standard for ARPA Internet Text Messages" [11] and is updated by the MIME standard [12]. Both the transport and message format standards hinder the exchange of Arabic e-mail messages. The SMTP standard stipulates the transfer of ASCII-text messages and, in fact, older implementations enforce this. This means that non-ASCII (8 bit) characters are not guaranteed to be transported to their destination. The message format standard RFC 822 specifies that a message has two main parts: the header and the body. The body is composed of lines, each of 1,000 characters or less of 7-bit U.S.-ASCII characters. The header is composed of lines, each of which is a long line of printable ASCII characters and has the general format: "field-name: field-body <CRLF>." An example of a header field is: Date: 13 Feb 88 1429 EDT So the two problems in exchanging Arabic e-mail are (1) correctly transporting Arabic messages that are encoded using 8 bits, and (2) specifying the language and character set used in a particular message, since transporting e-mail does not involve a prior exchange of information about content (as in HTTP). These problems are both solved by the MIME standard. MIME allows labeling and structuring message contents using RFC 822 headers, because it introduces a new set of header fields that are added to the message header. By so doing, it allows the sending of binaries and non-ASCII text through e-mail by encoding them in ASCII. Further, it facilitates specifying the character set used in a message. The MIME facilities important to our discussion are (1) encoding the message in 7 bits to be transported safely, (2) labeling of the character set used in the message body, and (3) labeling of the character set used in the message header. These facilities are discussed next. Encoding message bodyUsing MIME, 8-bit content in the message body can be encoded using 7 bits. The transfer encoding syntax is specified in the header field "Content-Transfer-Encoding," which can take on the values:
Base64 is a transfer encoding syntax that represents groups of 24-input bits as output strings of four encoded characters. The encoded characters are from an alphabet of 64 ASCII characters. This encoding increases text size by 33%. The following is an example of the header field of a message whose body is encoded in base64: Content-transfer-encoding: base64 Indicating body character setUsing MIME, the content of the message body is labeled using the special header field: "Content-Type," which has, as a parameter, the character set specification field "charset." The following is an example of the header field of a message whose body uses the ISO-8859-6 character set: Content-Type: text/plain; charset=ISO-8859-6 Indicating header character setUsing MIME, the message header can contain non-ASCII text by using inline labeling [13], whose format is: "=?" charset "?" encoding "?" encoded-text "?=" An example of a non-ASCII header is: Subject: =?ISO-8859-6?B?SWYgeW91IGNhbiB=?= where the encoding "B" refers to base64. The names of character sets that are used in MIME headers must be registered with IANA [15]. The registered character set names for Arabic include: ISO-8859-6 (ASMO-708), ISO-9036 (ASMO-449), Windows-1256, and ISO-10646 (Unicode). Extended SMTPGoing back to the transport of messages, Extended SMTP (ESMTP) [10] improves on SMTP by allowing the transport of 8-bit text by using the "8BITMIME" extension. However, both sides must use ESMTP and must negotiate first. It also might be necessary to MIME encode message first. Obviously, mail clients need to be MIME-compatible to benefit from the above-mentioned facilities. In fact, MIME is now used in most e-mail clients' Web browsers. In addition to supporting MIME, Arabized e-mail clients must be able to display Arabic text. Several Arabized e-mail clients are now available including: Sindbad from Sakhr, which is an Arabization layer for Netscape Navigator, Tango from Alis, which supports many languages simultaneously, and Exchange from Microsoft. Arabic web browsingThe specification of the World Wide Web (WWW) system has two main components: The page transfer protocol (HyperText Transfer Protocol, HTTP) and page description language (HyperText Markup Language, HTML). HTTP is an 8-bit clean protocol, meaning that it allows the transport of Arabic pages in 8-bit character sets. The major issues surrounding the use of Arabic on the Web are the labeling of the character set used and the of marking up Arabic pages in HTML. The internationalization of HTML standard [16] introduced many new features that facilitate the use of Arabic on the Web. These features are now incorporated in HTML 4.0 [17], which is based on Unicode and is a W3C recommendation at the time of this writing. The new internationalization features relevant to Arabic include: (1) indicating character set, (2) tagging of language, (3) mark of bi-directional text, and (4) controlling cursive joining behavior. These features are discussed next. This section concludes with a discussion of alternate web Arabization techniques. Indicating character setIndicating the character set of a document may now be performed in three ways, described here in increasing order of priority. First, it could be specified on the "charset" attribute of the "A HREF" element, as in the following example: <A HREF=doc.html CHARSET="ISO-8859-6"> ... <A> which specifies that the document "doc.html" uses the character set "ISO-8859-6." The second way is to use the "META" element in the HTML document header with the MIME-like content-type header, as in the following example: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-6"> which specifies that the HTML document uses the "ISO-8859-6" character set. The third way is to specify the character set in the HTTP header sent ahead of the document [18] MIME tags, as in the following example:
Language taggingLanguage tagging is different from character set specification. It helps in performing high-level operations such as searching, sorting, hyphenation, and spell checking. Specifying the language of a text block is done through the new "lang" attribute, which can be part of most HTML elements, including the new "span" element, as in the following example: <span lang=ar>... Arabic text ...</span> The language codes used with this attribute are defined in [19]. Bi-directional text markupThe HTML bi-directional specifications promote the use of the Unicode directional text display facilities. It stipulates that if the Web client (browser) claims to display bi-directional text, then it must use the Unicode algorithm. Text directionality is encoded in the directional property of the characters. Yet, additional directional markup is specifically needed for direction-neutral text and for tables. HTML offers higher-level markup constructs to control text direction, which have a function identical to Unicode's direction characters. An example where bi-directional text may require additional markup is that of neutral characters, as to determine the position of a double quote when its sits between an Arabic and a Latin letter. For that, two marks are defined: the left-to-right mark "&lrm " and the right-to-left mark "&rlm ," which are invisible characters with no effect otherwise. The direction attribute "dir" indicates the base directionality of the text and can take either of the two values "LTR" or "RTL." It is attached to block-type elements such as <HTML>, <P>, <LI>, and <TD>, and sets the default value of the "ALIGN" attribute as well. It also affects the correct placement for bullets and aids in setting up bilingual tables. An example of an Arabic table cell is: <TD lang="ar" dir="rtl">... Arabic text...</TD> Cursive joining behaviorTo mark up unusual cases for cursive text, HTML offers the zero-width joiner "&zwj," and the zero-width non-joiner "&zwnj." Here is the description from the HTML 4.0 [17] specification: The zwnj entity is used to block joining behavior in contexts where joining will occur but shouldn't. The zwj entity does the opposite; it forces joining when it wouldn't occur but should. For example, the Arabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamic calendar system. Since the isolated form of "HEH" looks like the digit "five" as employed in Arabic script (based on Indic digits), in order to prevent confusing "HEH" as a final digit five in a year, the initial form of "HEH" is used. There is no following context (i.e., a joining letter), however, to which the "HEH" can join. The zwj character provides that context. FormsWhen filling a form, the Web client needs know what character set the server will accept. The "ACCEPT_CHARSET" attribute, which attaches to the "FORM" element, specifies the list of character encodings for input data that are accepted by the server processing this form. As an example: <FORM ACTION=... ACCEPT_CHARSET="ISO-8859-6, UCS-2"> There yet to be an effective method for specifying the character set in a filled form. Other Arabization techniquesThe confusion in the character set scene and the complexity of displaying Arabic text have restricted the growth of Web use in Arabic. Web publishers have resorted to creative techniques to overcome these problems. Text as image techniquesThe first solution that gained wide use was presenting text as graphics. Pages of text were converted to images, which could be displayed by most Web browsers. Needless to say, this solution suffers from many disadvantages, most notably the huge increase in page size (almost two orders of magnitude), which means slower page download and display. Also, it is not possible to do text operations such as search, selection, copying, or editing on the text images, unless the image undergoes character recognition, which is not necessarily accurate. This is in addition to the burden it places on the publisher of the pages in terms of increased storage space and complex publishing procedures. A slight improvement using similar technology was to use individual character images and construct words and sentences from the character images instead of making an image for the whole text block [20]. The number of different images that may be used is limited to the number of Arabic character shapes (note that an Arabic character can have multiple shapes). This means that contextual analysis is performed to select the appropriate character images. This method reduces the download time needed, because the character images are stored in the browser and need not be downloaded each time. This method does not address, however, the need to edit the text and perform text operations such as searching and indexing. Other publishers took another route by providing multiple versions of the same texts, such as providing an image version and texts in different character sets. Proxy conversionStill another proposed solution is to have the Web server deduce the characteristics of the client (browser), particularly the supported character set, and then supply a version of the page that was suitable for the client. Such a solution is proposed in [21]. There, a proxy web server translates the character set of web pages on the fly. The pages could be stored in Unicode and converted on the fly to the required character set. Java appletA solution based on Java is to have the intelligence to display Arabic in a Java applet and to let it manage displaying Arabic text in the browser. This means having a Java applet running, adding computation overhead and also limiting text-processing options. Automatic detectionAnother technique that is used particularly in browsers is to add intelligence to infer the character set of a Web page by analyzing its contents, as in the Sindbad browser from Sakhr. The browser then switches to the inferred character set and displays the page for the user. This involves some heuristics but appears to be relatively accurate, as there are only two major Arabic character sets to worry about. A similar technology is used in the Alta Vista search engine [22] to infer the character set. It is currently capable of inferring the languages from a set of 25 languages (excluding Arabic). In summary, to be Arabized the browser must have the ability to select a language, select a character set, load a font, accept Arabic input, and display bi-directional text. Browsers that provide these capabilities to different degrees are now available, including: Sindbad from Sakhr, which is an Arabization layer for Netscape Navigator; Tango from Alis, which supports many languages simultaneously; and Explorer from Microsoft. Arabic search and indexingWith the huge amount of information on the Internet, search and indexing tools are crucial for locating specific resources and organizing information. The search and indexing of Arabic text is more involved than other languages such as English. The paper by Al-Kharashi [23] provides an extensive coverage of this topic. Arabic is a derivational language where words are derived from a root. Searching and indexing Arabic text (e.g., for searching the Web) must rely on the root of a word and not merely on the final form. Further, the same word can have more than 100 combinations of prefixes and suffixes, which in English would be preceding stop words such as "with" and "for." These numerous derived words, although related in meaning, are not necessarily consecutive in a word index, implying that a word and its derivations could have a lot of entries spread out in an index. Search systems for Arabic need to employ morphological analysis, which is an involved process and has its limitations. The large number of synonyms of Arabic words intensifies this problem greatly. Further, Arabic has a large number of combined word expressions. To search for an expression, one needs to use logical operator such as "and" or "near" to find the expression. Further, current usage of Arabic includes many foreign words that are written in Arabic letters. These foreign words can be written in different ways, which leads to difficulties in retrieving them. A few search and indexing systems are currently available on the Internet that can handle Arabic text. Some of the standard search engines on the Internet such as Alta Vista [22] and Infoseek [24] allow entering Arabic keywords and searching for Arabic documents, although Arabic is not officially supported by them (assuming the use of an Arabized browser). Specialized Arabic site indexes of the Internet include Ayna [20] and Naseej [25]. Specialized Arabic search engines include Ayna [20] and Alidrisi [26]. The specialized Arabic sites provide varying native Arabic search capabilities; however, the coverage is still limited. SummaryThe Arabic language is being increasingly used on the Internet, despite significant obstacles. The paper has discussed the major problems and outlined how new international standards and new Internet protocols are helping to alleviate the problems. One of the biggest problems is the issue of multiple character sets for representing Arabic. The use of MIME tags in specifying the name of the character set used has been discussed. The advent of Unicode will perhaps mean the replacement the other character sets and thus a unified character set for all languages. However, this is not expected to happen anytime soon. Arabic has some particular characteristics that require specialized display routines, including the need for contextual analysis to select the appropriate character shapes and the incorporation of a bi-directional text display routine to order the text correctly. The Unicode standard provides a bi-directional display algorithm. The two most important Internet applications, e-mail and the WWW, must work in Arabic seamlessly. The paper discussed how MIME facilitates this for e-mail by allowing the specification of character set and by encoding 8-bit messages in 7-bits for safe transport. The new HTML 4.0 specification also provides facilities for Arabic character set indication and for Arabic message markup. Alternate techniques for Web Arabization were discussed as well. These are viewed as interim solutions until a simpler and more satisfying solution is found. It is our opinion that when Web servers adhere to HTTP standards and include character set information in the header for page transmission, and when Web browsers use that information and set up the pages accordingly, the problem will be solved. The ability to search and index Arabic content on the Internet is crucial. Features of the Arabic language that relate to text search were mentioned, and available systems were listed. References
|