Leong Kok Yong <email@example.com>
Tan Tin Wee <firstname.lastname@example.org>
Naa Govindasamy <email@example.com>
Lee Teck Chee <firstname.lastname@example.org>
Currently most of the content on the Web is mainly in Romanized characters. Content in non-Roman languages is limited and often found transliterated into Romanized form. However, to be a truly successful global information superhighway, information on the Internet must be delivered to the users in the script of their native language. Until workable standards are agreed upon and accepted, ad hoc solutions abound, and may generally be classified into:
For client-side solutions, we used a bilingual font system to allow users proficient in their mother-tongue language and English to navigate through the Web easily. We have currently implemented this for the Tamil and Hindi languages as a proof of concept. For Chinese, Japanese and Korean, we have integrated code with helper applications such as Twinbridge (GB, Big5, HZ), UnionWay (GB, Big5, HZ, JIS, KSC) and WinMASS (GB, Big5, HZ, JIS, KSC, Unicode), and this proof-of-concept demonstrated in 1994 has subsequently led to Chinese newspapers coming online in Singapore.
Our server-side solutions enable the display of non-Romanized codes without the use of additional helper applications on the client-side browsers or modifications to the browser software. The concept is to automatically convert those codes saved in various encodings into images before the Web server sends out the HTML pages. This is network-expensive, and we are currently exploring Java applet technology as a solution.
Display of multilingual data alone is not enough. Users must be able to input search strings or fill in forms through the Web, and retrieve from multilingual databases. One solution is to make use of Java applets to control keyboard mappings of a different language and display the script as it is keyed in. We demonstrate this concept with a Tamil-input Java applet as a proof of concept, coupled with a WAIS-SF index of Tamil text files at the backend to show search and retrieval of Tamil text based on the Tamil input string.
In the past three years, the Internet revolution throughout the computing world, catalyzed largely by the World Wide Web (WWW), has enabled the widespread dissemination of information worldwide. However, much of this information is in English or in languages of Western origin (Bos, 1996). Presently, the Internet is positioned to be an international mechanism for communications and information exchange, the precursor of a global information superhighway. For this vision to be realized, one important requirement is to enable all languages to be technically transmissible via the Internet, so that when a particular society is ready to absorb Internet technology, the language capability comes prepackaged. This is a nontrivial multilingual information-processing problem.
A number of languages do not have written forms. Some do not have a standard computerizable script or a national language character set. Yet others are pictographic or ideographic, and the number of distinctive characters runs into the tens of thousands. Some are not written in the standard and familiar left-to-right format; rather, they are written from the right to left (e.g., Hebrew, Arabic, Jawi and Urdu). Others may be written top-to-bottom (e.g., Mongolian, calligraphic Chinese and Japanese). Even if the fonts and glyphs can be generated, the standard computer keyboard has to be adapted for the user-friendly input of multilingual data and for a number of languages; simple one-to-one mapping of the keys to characters may not be sufficient to generate the character-set. Additional applications or front-end processors have to be used. Alternatives such as voice-assisted input are emerging but have not yet reached the mass market for most non-European languages.
In the case of the graphical WWW, arguably the most sensationalized Internet application in the media, the trivial solution for display of multilingual script would be to digitize the printable or written language into a GIF or JPEG format, and deliver the script as an inlined image. In the long term, this is not an acceptable solution for several reasons: the information is not easily generated, the data are not machine-readable, the content is not indexable for textual search and retrieval, and it is not network-efficient. Working groups and committees in the Internet realm are currently addressing these issues. Here we present some solutions to the problem.
For the alphabetic or alphabet-like languages, which have character-set sizes that can fit easily into the extended ASCII space, the display of their script is straightforward. For the European languages, Cyrillic, Greek and Hebrew, as defined in ISO 8859, all the user has to do is to select the appropriate font during a Web browser session to render eight-bit encodings into the respective language script. These font sets are also bilingual in that the lower ASCII range is in English whereas the upper range is in the respective language. So a Web page can display both languages at the same time.
Using this idea, we have taken the Tamil language (an official language of Southern India, Malaysia and Singapore), created a character font set and mapped it to the extended ASCII range. The following describes how we have implemented client-side and server-side solutions for Tamil.
Like many other Indian languages, Tamil is a phonetic language. It consists of 12 vowels and 18 consonants. These 30 sounds are the initial sounds of the Tamil language and the basis for the Tamil alphabet. The 18 consonants joined with 12 vowels form the remaining 216 vowel-consonant characters. This was clearly explained as early as 250 B.C. in the first Tamil grammar book, Tholkappiam. Later on, when some Sanskrit sounds were incorporated, 5 new consonants were added to the Tamil alphabet.
We have designed a bilingual font set for the display of both Tamil and English simultaneously. This was done by making use of the upper extended ASCII character range for the Tamil characters, while retaining the basic English alphabet and punctuation intact in the lower ASCII range. This will allow most of the Web world in English (or other Romanized languages) to be traversed; at the same time, Tamil codes will be recognized and displayed correctly when they occur, without having to change font set. Figure 1 below shows the ASCII character map for the Tamil-English font set.
One important point to note is that the upper ASCII portion does not have enough code space to include all the possible Tamil character glyphs (>200). As such, we make use of the kerning feature built into the Postscript and the True-Type font technology to combine two Tamil characters into a new character glyph not found in the above ASCII table. With the combination of two simpler character glyphs to give a more complex glyph, we can then include the entire Tamil character set within one single font, together with the English alphabet. To allow users to input these Tamil characters, a corresponding keyboard layout mapping has been devised by mapping the keys on a normal English (QWERTY) keyboard to the extended ASCII range where these Tamil characters reside.
Based on the Phonetic system, a phonetic keyboard for the Hindi language was developed by Mohan Thambi in India in 1983. This was subsequently adopted as the main keyboard for the Indian languages by the Department of Electronics (DOE) of India. However, this keyboard is based on Devanagiri script. Since Tamil is from the Dravidian language group rather than the Indo-Aryan, of which numerous other Indian languages belong (e.g., Hindi, Marathi, Punjabi, Bengali) (Grimes, 1992), the DOE keyboard is not particularly suitable for the Tamil language.
To overcome the keyboard problem for the Tamil language, Naa Govindasamy began an investigation into the frequency of occurrence of the Tamil vowels and consonants used within the language. Based on this research, a Tamil phonetic keyboard layout was introduced for Tamil computing (Govindasamy, 1989), named the Singapore Tamil Keyboard (Govindasamy, 1994a). Later in September 1994, the name was changed to Kanian Keyboard at the First Computer-Tamil Conference at the Anna University, Madras, India (Govindasamy, 1994b).
The keyboard consists of the 12 basic Tamil vowels placed on the left-hand side of the keyboard and the 18 consonants. The 28 basic vowels and consonants are placed in the lower case of the keyboard, while the 2 least frequently occurring Tamil consonants are placed at the upper case with the 5 Sanskrit sound consonants. For modern Tamil, a vowel will not appear in the middle or at the end of a Tamil word; it will appear only at the beginning of a word. These basic rules were taken into account when this keyboard layout was designed. The advantage of this keyboard layout is that 99.5 percent of the time, Tamil characters can be typed without pressing the shift key at all. Moreover, the most frequently used vowels and the consonants are placed at the home keys (the middle row of the keyboard). This allows the user to type 68 percent of the Tamil words by using only home keys.
Because of its simplicity and the incorporation of the Tamil grammar, this keyboard layout is very popular in Singapore and Malaysia and has been incorporated in numerous Tamil front-end processing software and word processors, including a commercial version available from the author (Govindasamy <email@example.com>). All the Tamil newspapers in Singapore and Malaysia are using this keyboard layout to input their Tamil language data. A 95 percent majority of the Singapore Tamil computer users are also using this keyboard method.
Through the experience with this keyboard, we recommend that designers of keyboards of other languages consider this approach of character frequency analysis to improve user input efficiency. We are applying this technique to other languages of Indo-China such as Khmer and Lao, in addition to providing support for their currently existing keyboards. Keyboard manager software (shareware) is also freely available, e.g., WinKeyB by Jarle Petterson <firstname.lastname@example.org> and Tavultesoft Keyboard Manager from the Summer Institute of Linguistics (http://www.sil.org). Users can define and customize the keyboards according to their needs using these tools, without having to write their own software.
For users to submit a string for search and retrieval, the browser software must be able to support alternative fonts in the input boxes of HTML forms. However, some popular browsers such as Netscape do not currently support this. Consequently, input of upper ASCII-range characters defaults to the basic system font and the selected upper extended ASCII character sets (in this case, Tamil fonts) are not displayed for input feedback. One way is to make use of front-end processors that will also intercept the selected encoding and display the corresponding language script on the screen. This will be described in detail for the nonalphabetic languages in the next section. Meanwhile, we have developed a server-side solution based on Java applets (http://java.sun.com), precompiled on the server, and retrievable by the user in bytecode for execution on the user's machine.
This Java applet system carries out the front-end processing so as to mimic the Kanian keyboard. We have demonstrated that it is possible to input simple strings for submission to a search and retrieval engine (http://irdu.nus.sg/tamilweb/search.html). This is done by intercepting keystrokes and passing them through a set of rules and mapping, in order to generate the correct glyph in the Java applet window. Upon submission, the encodings corresponding to the glyphs are submitted to the Web server for processing. This Java solution will work for Unix and Macintosh users in addition to PC users if their browsers support Java.
In addition, by means of a dynamic Java applet generator, we are in the process of creating a system of user-definable keyboard and user-supplied bitmaps of character fonts for any alphabetic language based on a user-defined encoding. In this way, we hope to build a repository of fonts and keyboard input systems with code interconvertibility so that when a standard is achieved for a particular language, content can be easily interconverted to the new format. Meanwhile, users can begin to input text in their own languages and start building up content in their languages on the Web.
For search and retrieval, the submitted string in extended ASCII for Tamil (and in English as well for bilingual searches) is parsed by the httpd server and submitted as a search string to any indexing engine that has multilingual capability. In the case of Tamil, we used a simple WAIS-SF indexer and demonstrated the utility. Hits will be returned in the same code, and displayed in the same way as described above, with bilingual capability. By the same token, we are exploring the use of this method for indexing the content of other languages including Hindi.
To enable future compatibility to Unicode (http://www.unicode.org) and/or other standards when HTML includes markup tags for international languages, the existing codes have to be converted to these future standards. To demonstrate this, we have developed converter programs to map this Tamil-English bilingual font encoding to the corresponding Unicode encoding. Until such time when HTML can support multilingual markup tags, and client browsers support international languages, these features will be useful. In fact, it has allowed the Tamil community in Singapore to join the ranks of Web publishers (http://irdu.nus.sg/tamilweb) without fear of obsolescence.
The above integrated solution will work smoothly for PC users only. For users who cannot or do not know how to select fonts, or for users on Unix or Macintosh platforms, we noted that the basic common denominator is the capability to display inlined images. Hence, we have designed a dynamic Tamil code-to-GIF converter that will take the extended character encodings on an HTML page and convert them on the fly at the server into inlined images. Any Web browser would thus be able to view the bilingual Tamil and English text. How this is done will be described further in the context of nonalphabetic languages below. The same technique can be applied to any other language, and we have demonstrated this for nonalphabetic languages, which is described next.
Ideographic or pictographic languages do not use alphabets to build their words. Instead, each character has a separate meaning. Consequently, these ideographic languages have character sets running into the thousands and tens of thousands. Examples include Chinese, Korean and Japanese Kanji, which have their roots in the Chinese Han language. Standard character sets and encoding standards abound. For example, on the Internet there are three popular coding systems for Chinese: Big-5 (Taiwan), GuoBiao (GB) (popular in Mainland China) and HZ (popular with e-mail and newsgroup postings) (please see http://www.ifcss.org). For Japanese, there is JIS, Shift-JIS and EUC (extended Unix code for Japanese) encoding methods supporting character sets such as JIS X0208-1990 and JIS X0212. For Chinese-Japanese-Korean Han unification, there is the Unicode/ISO 10646 international standard.
For such languages with large character sets and complex input systems, helper applications or front-end processor software intercept keystrokes on either specialized or standard keyboards. A conversion dictionary displays a list of candidate characters from which the user can select the correct ideographic character corresponding to the keyboard input. This process is typically based on phonetics or radicals. Conversion tables usually have tens of thousands of entries mapped to multiple-byte encodings.
Since 1994, we have demonstrated with Web pages of Chinese and subsequently Japanese and Korean that helper applications such as Twinbridge (Big5, GB), UnionWay (Big5, GB, HZ, JIS, KSC), and WinMASS (Unicode, Big5, GB, HZ, JIS, KSC) can be used in conjunction with ordinary Web browsers to render multilingual encoded data into the appropriate language. With this, we have put up Tang Dynasty Poems, Sun Tzu's Art of War and other Chinese text online since mid-1994. By the end of 1995, using the method of helper applications, Singapore's local newspaper Lianhe Zaobao and the Guangzhou Ribao (http://www.asia1.com.sg) have come online as the first Chinese Asia newspapers on the Internet. But if users do not have such helper software, it is still possible to display the Chinese text using server solutions.
Our server-side solutions enable the display of various non-Romanized codes without the use of additional helper applications on the client-side browsers. The concept lies in automatically converting those codes saved in various encodings into images before the Web server sends out the HTML pages.
In the current implementation, we have httpd server cgi-bin programs (http://www.w3.org/hypertext/WWW/CGI/Overview.html) to automatically convert Chinese GB, HZ encoding, Japanese JIS encoding, Unicode encoding in UTF-8 stream (for Chinese-Japanese-Korean-Tamil) and the Tamil-English bilingual ASCII encoding. Text pre-entered in any of these codes and embedded in HTML pages can be sent to our server for automatic on-the-fly conversion into images. On output, the server sends out a newly created HTML page with standard ASCII text preserved, but other language codes are converted to GIF images. These composite GIFs can be natively displayed on any Web browser that supports inline images. We note a similar recent work by Valentin Shopov and A. Kitauchi using DeleGate with Character by Inline Image CII support for Bulgarian, Greek, Hebrew, Chinese, Japanese, Korean, and other languages (http://baka.aubg.bg).
In the following sections, the general term "x2gif" refers to the converters that include gb2gif, hz2gif, jis2gif, and uni2gif. Similarly, the term "xfilt" refers to the filters, including gbfilt, hzfilt, jisfilt and unifilt. The term gb refers to Chinese GB, hz to Hanzi, jis to Japanese JIS and uni to Unicode (UTF-8).
The auto code-to-GIF converter works as follows. It comprises two components--a filtering module and a converter module. One module filters the HTML pages for special char encodings and encapsulates them with a <img src> tag, which in turns activates the second module that converts the special encoding passed to it on its command line arguments to GIF images.
The module that filters HTML pages for special encoding is written in Perl 4.036. It implements the Common Gateway Interface (CGI), which accepts a URL as a command line argument. It is executed as follows:
http://irdu.nus.sg/cgi-bin/xfilt?<url>where <url> is an absolute URL of an HTML document and the first portion is referring to the cgi-bin filter module. Thus, the URL
http://irdu.nus.sg/cgi-bin/unifilt?http://www.iscs.nus.sg/unicode.htmlwill enable the cgi-bin program to convert the HTML document "unicode.html" residing on the host "www.iscs.nus.sg".
The given URL is first checked for malformation or invalidity. With a valid URL specified, the module issues a HTTP request for the entire HTML document. It contacts the host Web server by opening a TCP socket connection through the specified port (or the default port 80) and sends a GET command with the document name specified in the URL. The HTML document stream is then filtered using a finite state machine algorithm. Normal 7-bit ASCII codes are considered Romanized English alphabets (including symbols and numerals). Any other extended/upper ASCII codes are considered non-English-language encoding; this usually has the most-significant-bit (MSB) or the 8th bit set. The retrieved document is checked character by character for a high bit set, indicating a non-7-bit ASCII code. For normal 7-bit ASCII bytes (MSB not set), the module outputs the same byte read in. On encountering an ASCII byte with the MSB set, the module precedes the output with an extra tag <img src=/cgi-bin/x2gif?. Then it outputs all of the following extended ASCII bytes, which may be GB, HZ, JIS or UTF-8 codes, until a lower ASCII byte ends the HTML tag with a closing bracket >. The module repetitively performs the above operations for the rest of the documents.
In the second module are the x2gif components. Each of these C-language programs accepts bytecodes on its command line arguments and produces GIF images on its standard output. For those two-byte encodings (GB/HZ/JIS/Unicode), the module takes two bytes each and performs a table lookup into a bitmap font file for the font glyph to display the character. We use two bitmap font formats for the font used for the various supported languages. They are HBF (Hanzi Bitmap Format) and BDF (Bitmap Distribution Format). The Unicode/GB/HZ encoding uses the more efficient HBF format while extended-ASCII Tamil-English encoding uses the BDF format. The font glyphs for each character are all stored in memory before converting the entire raw image to the GIF format for output.
The following series of figures gives a graphical illustration of one working example.
Figure 2 below shows one example of an original HTML document displayed on any browser without native support for multilingual capabilities.
The source for the above figure is:
<html> <head><title>Sample page with multiple languages</title></head> <body> <br>Welcome <br>æ¬¢è¿ <br>æ¡è¿ <br>ããã"ã <br>ã´¡ã¦- <br>à®¨îà®µà®°à®µà®¾î"à® </body> </html>
However, by passing the above HTML document to the cgi-bin program for filtering, the HTML document will instead look as follows in Figure 3.
The source for the above output is:
<html> <head><title>Sample page with multiple languages</title></head> <body> <br>Welcome <br><img src=/cgi-bin/uni2gif?"æ¬¢è¿"> <br><img src=/cgi-bin/uni2gif?"æ¡è¿"> <br><img src=/cgi-bin/uni2gif?"ããã"ã"> <br><img src=/cgi-bin/uni2gif?"ã´¡ã¦-"> <br><img src=/cgi-bin/uni2gif?"à®¨îà®µà®°à®µà®¾î"à® "> </body> </html>
As can be seen from above, the Unicode encoding is preserved, but they are all encapsulated within the <img src> tags and also passed to the cgi-bin program 'uni2gif' for preprocessing before being presented as GIF images.
The advantage of this approach, in the absence of an international standard for marking up multilingual texts, is that two or more languages can be viewed simultaneously on the same Web page. However, this remote server approach adds a strain on the network and server load if carried out for large documents. This solution can be effectively implemented on Intranets with distributed servers carrying the load.
Currently, our research and development program is developing a Java applet that will allow any user on any platform (that supports Java) to input characters in Tamil, Chinese, Japanese, Korean and other languages. We will be creating a multilingual server that will process dynamic auto-code-to-GIF conversion for more languages. Moreover, we hope to expand our repertoire of Java applications to include user-customizable Java applets that will allow users to specify their language, font sets and encodings. This way, users will be able to start inputting data in their own languages expeditiously. For the bilingual font systems for alphabetic languages, we will be extending it to other languages like Hindi, Sinhalese, Khmer, and other Indo-Chinese languages.
To date, there is still no well-defined way of handling documents that contain multiple languages (Nicol, 1996), although efforts are underway at the World Wide Web consortium and the Web Internationalization and Multilingualism (WInter) group and many others. As Nicols observes, "ad hoc solutions abound, which, if left unchecked, could lead to groups of users suffering in incompatible isolation, rather than enjoying the true interoperability alluded to by the very name of the World Wide Web." Yet so long as the internationalization groups working to internationalize the Internet continue to lack participation of native speakers and national representatives of the very languages targeted for support, ad hoc solutions for specific languages will continue to abound with the diversity of language and country touched by the Internet.
One reason, taking the example of character code standards or character set mappings, existing and proprietary standards abound in the countries of use, with significant following. Imposing Internet standards that cannot interoperate with formats used by existing software prevalent in these countries may lead to low acceptance level and slow take-up rate and consequently will delay the delivery of multilingual information on the Internet.
In this paper we have described some solutions to the problem and tried by proof-of-concept and quick prototyping that these can be used for the moment, until the standardization process matures. We have shown that reading multilingual texts on a World Wide Web page from the Internet is indeed possible, even without any font change or additional helper tools on the client browser side. We have indicated how by using interconversion programs, it is possible for migration to the future standards when they arrive. Hence the effort expended in building multilingual content right now will not be lost. Information providers of multilingual content on the Internet can immediately come on board without much technical hindrance, and without serious fear of incompatibility or obsolescence. We hope this encourages more information providers to put up more multilingual information on the World Wide Web.
We acknowledge the pioneering work of Mr. James Seng, who was one of the originators of the auto code-to-GIF idea. We thank him for his effort in building the early versions of Chinese content, GB, HZ, JIS convertors.
Mr. Leong Kok Yong is an analyst programmer at the Internet R&D Unit (IRDU) in the National University of Singapore (NUS), specializing in the area of multilingualism on the Internet. He is a graduate of the Nanyang Technological University with Mr. Lee Teck Chee, who is presently with Creative Technology.
Dr. Tan Tin Wee is currently the head of the IRDU, NUS. He was previously the head of Technet Unit (now privatized and known as Pacific Internet), the pioneer Internet access provider in Singapore, and a senior lecturer in the Department of Biochemistry, both at the NUS. He obtained his B.A. from the University of Cambridge, M.Sc. from University College London, and Ph.D. from the University of Edinburgh, United Kingdom. He is an active technology proponent of the Internet in Singapore, having set up the first WAIS, Gopher, Web, CU-SeeMe, and Mbone sites in the region, and maintains a research interest in Bioinformatics and Biocomputing.
Mr. Naa Govindasamy is a lecturer in the National Institute of Education, Nanyang Technological University, Singapore. He teaches the Tamil language and literature as well as Tamil computing. He is the inventor of the Tamil Kanian Keyboard, which is widely used in Southern India and Southeast Asia. E-mail: email@example.com, firstname.lastname@example.org. URL: http://singnet.com.sg/~govin.
Mr. Leong Kok Yong <email@example.com>
Tel: (65) 772 8093 / (65) 772 3119. Fax: (65) 872 6205
National University of Singapore
10 Kent Ridge Crescent
Singapore 119260, Singapore