S. T. Nandasara <firstname.lastname@example.org>
University of Colombo
K. Y. Leong <email@example.com>
National University of Singapore
V. K. Samaranayake <firstname.lastname@example.org>
University of Colombo
T. W. Tan <email@example.com>
National University of Singapore
Sri Lanka is a multiracial society comprising a 74% Sinhala-speaking population and an 18% Tamil-speaking population. Sinhala, Tamil, and English are all official languages and are extensively spoken throughout this country of 18 million people.
In July 1996, Sri Lanka launched its national Web site (http://www.lk), initially consisting of information entirely in the English language because it is the language of commerce and the government's second language. Moreover, most people who have Internet access and are computer-literate are proficient in English. However, because of our population profile, it is absolutely essential that Internet data be made available in Sinhala and Tamil in order for the Internet to reach our masses.
Currently most content on the Web is in Romanized characters. Hence, the need for the development of trilingual Sinhala-Tamil-English Internet applications that allow Web display, e-mail interchange, and other essential Internet functions in all three languages.
In addition, keyboard input technology is being developed to allow users to download, search, and create new data and build their own content in each language. As demand increases, this will encourage information providers to design technology to put up quality information in all three languages so that Internet technology can take root in Sri Lanka.
In the past three years, the Internet revolution throughout the world, catalyzed largely by the World Wide Web, has enabled the widespread dissemination of information worldwide. However, much of this information is in English or in languages of Western origin. Presently, the Internet is positioned to be an international mechanism for communications and information exchange, the precursor of a global information superhighway. For this vision to be realized, one important requirement is to enable all languages to be technically transmissible via the Internet, so that when a particular society is ready to absorb Internet technology, the language capability comes prepackaged. This is a non-trivial multilingual information processing problem.
Although Sri Lanka has only had Internet connectivity since 1995, it has already had a significant impact on the country, and the information technology (IT) industry in particular. The Internet has received wide publicity in the media and most people are at least aware of the term, even if they have not had any firsthand experience with it.
The first Internet e-mail service, LEARNmail, was inaugurated by the University of Moratuwa in April 1990 . The connectivity to the Internet was provided by Sri Lankan volunteers at the University of California, Purdue, and Stanford University.
The first Internet service provider who actively offered services in the country was Lanka Internet Services Ltd., which commenced an online Internet service in June 1995. Sri Lanka Telecom, Lanka Communication Services, and Electrotec Network Services started their Internet services in late 1995. Lanka Educational And Research Network (LEARN) provided dedicated as well as dial-up services from early 1996.
The most common use of the Internet today is for e-mail, especially for international traffic. Services such as File Transfer Protocol (FTP) to transfer data files, Usenet conferencing, Internet relay chat, and Internet phone are also used, but less frequently. The other popular service is the World Wide Web (WWW), which is accessed for information, news, and entertainment.
The Sri Lanka national Web site was inaugurated in July 1996. This provides a central point of entry for information on the country.
In the case of the World Wide Web, arguably the most sensationalized Internet application, the trivial solution for display of multilingual script would be to digitize the printable or written language into a GIF or JPEG format and deliver the script as inlined images. The approach we have, however, is much more general and allows for all types of other applications.
Sinhala is a member of the Indo-Aryan family of languages and its script bears close structural resemblance to Thai and Malayalam scripts. The Sinhala writing system is a syllabary derived from the ancient North Indian script Brahmi and subsequently influenced by the Pallawa Grantha script of South India. The modern script used in writing Sinhala is unique to this language.
The Tamil script is used to write the Tamil language of Tamil-Nadu state in India as well as minority languages in Badaga, Singapore, and part of Sri Lanka and Malaysia, which are genetically unrelated to the North Indian languages such as Hindi, Bengali, and Gujarati. The shapes of letters in the South Indian script are generally quite distinct from the shapes of letters in Devanagari and its related scripts. This is partly a result of the fact that the South Indian scripts were originally curved rather than square, blocklike shapes.
The Tamil script has fewer consonants than other Indian scripts. It also lacks conjunct consonant forms. Instead of conjunct consonant forms, the virama (U+0BCD) is normally fully depicted in Tamil text.
Sinhala differs from all other Indo-Aryan languages in that it contains a pair of vowel sounds (U+0DD0 and U+0DD1 in the proposed Unicode Standard) that are unique to it. These are the two vowel sounds that are similar to the two vowel sounds that occur at the beginning of the English words at and ant. The vowel sound in at is short, and the vowel sound in ant is long. The Sinhala alphabet also has a pair of characters to represent these two sounds:
Another feature that distinguishes Sinhala from the sister Indo-Aryan languages is the presence of a set of five nasal sounds known as half-nasal or prenasalized stops.
Table 1 shows how these sounds are represented in modern Sinhala writing and Roman script.
The Sinhala alphabet consists of 61 symbols: 18 vowel symbols, 41 consonant symbols and 2 semi-consonant symbols.
<Sinhala-alphabet> ::= <Vowels><Consonants><Semi-consonants>
These symbols represent 40 sounds: 14 vowel sounds and 26 consonant sounds.
In Sinhala the 18 vowel symbols, unlike consonants, are used only at the beginning of words, as shown in Table 2.
The Sinhala alphabet possesses 41 consonants as given in Table 3.
Consonant modifiers (also known as character additions) are graphical signs always used in conjunction with consonants. The consonant modifiers of the Sinhala script occur in two different forms, as vocalic strokes and non-vocalic strokes.
The consonant modifiers, their names, and vowel representations are given in Table 4.
The Tamil alphabet (<Tamil-alphabet> ::= <Vowels><Consonants>) consist of 48 symbols: 22 consonants and 26 vowels. The symbols are given in Table 5.
Consonant modifiers in Tamil are graphical signs (like in Sinhala) used in conjunction with Tamil consonants. These consonant modifiers can occur on the left, right, and top of any Tamil consonants.
The Sri Lanka standard for the Sinhala character code for information interchange  has been prepared to fall in line with the requirements laid down in ISO/IEC 10646 and submitted to the ISO. The code page reserved for Sinhala is U+0D80 - U+0DFF. Tamil has already been approved and its code table spans U+0B80-U+0BFF .
It is evident that the Sinhala character set consisting of vowels, semi-consonants, consonants, and consonant modifiers and the Tamil set consisting of vowels, consonants, and consonant modifiers have clear differences, mainly with respect to the size of characters. Some characters are much bigger than others, for instance. Their shapes also differ. Although the basic shape of characters is curved, some parts, such as the upper or lower part, may not be contained in the same line.
Unlike in English, most of the Indic language consonant modifiers could be positioned at different locations around the character. These modifiers for Sinhala can be classified as follows.
<Consonant-modifiers> ::= <Left-modifiers><Right-modifiers> <Upper-modifiers><Lower-modifiers>
For Tamil they can be classified into three groups.
<Consonant-modifiers> ::= <Left-modifiers><Right-modifiers> <Upper-modifiers>
In the Sinhala language, combinations of consonants and consonant modifiers produce different phonetic sounds. For example, the combination of the consonant (k) and consonant modifiers 1 to 11 given in Table 4 produces 17 different phonetic sounds for the character (ka). See Table 6 for these combinations. Character additions are generally used to represent vowels.
Table 6 - Combination of consonant
(k) with consonant modifiers 1 to 11 of Table 4 in alphabetical
The combination of the two semi-consonants and vowel signs provides 49 combinations of different phonetic sounds with consonant (k); the characters formed by using the consonant modifiers 1 to 13 of Table 4 also provide 51 combinations and are included to preserve their alphabetical order.
This demonstrates that the total number of glyphs for combinations of Sinhala consonant modifiers and semi-consonants with the 41 consonants of Table 3 is (41 x 100) 4100.
In Tamil, it is important to emphasize that in a font that is capable of rendering combinations of Tamil scripts , the set of glyphs is greater than the number of Tamil characters. However, the total number fits into a 25x13 matrix and this amounts to 325 glyphs including vowels and consonant modifiers.
As shown in Tables 7 and 8, the following vowels are always reordered in front of the previous consonant cluster in both Sinhala and Tamil.
In Tamil, the same effect occurs with the results of vowel splitting. This does not occur in Sinhala.
In both cases, the ordering of the element is unambiguous: the consonant (cluster) occurs first in the memory representation.
In some cases, more complex vowel reordering will occur in the Sinhala language as shown in Table 9.
Link key/link code is used to combine two coded characters to generate link or joint formations. For example . Short key/short code is used to create repaya and other conjunct formations. Invisible key/invisible function codes are used to delete a particular character or part thereof. Conjunct formations can be handled as given in Table 10.
The development of Unicode has provided an excellent base for constructing truly globalized software. After many years of softer localization, there are many different code standards for difference languages and even for the same language. Though these code standards will continue to exist in the immediate future, it is foreseen that once Unicode-based software and data appear, the other codes will die.
Figure 1 shows the Sri Lanka national Web site home page with its multilingual content.
The national Web site, which was launched in July 1996, aimed to provide information completely in English. Today, a wide range of English data on Sri Lanka, including the five daily English newspapers, information about the government, central bank, government offices such as the Department of Census, etc., can be easily made accessible to any Internet user on the World Wide Web. The data will progressively be translated into Sinhala and Tamil. We have initiated a collaboration between the Institute of Computer Technology, University of Colombo and the Internet Research and Development Unit of National University of Singapore to pioneer trilingual Sinhala-Tamil-English Internet applications that allow for trilingual Web display, e-mail interchange, keyboard input, and other essential Internet functions.
The main objectives of the trilingual national Web site are given below.
The trilingual national Web site allows users to view the same piece of information in their preferred language, which could be English, Sinhala, or Tamil. It also allows users to search the database using their preferred language. For example, the user can use English or Sinhala or Tamil to search the record of a company, provided the company has English, Sinhala, and/or Tamil names.
Server side solutions enable the display of Sinhala/Tamil non-romanced codes without the use of additional helper applications on the client-side browsers. The concept lies in automatically converting those codes saved in various encoding into images before the Web server sends out the HTML pages.
In the current implementation, both Sinhala and Tamil displays over World Wide Web have been prototyped using an automatic code to GIF conversion system (http://irdu.nus.sg/multilingual) The Sinhala and Tamil systems are available in both single-byte font code and in double-byte Unicode. However, the true Sinhala version can only be accommodated using double-byte Unicode because Sinhala has a more complex structure.
Using the auto-GIF conversion system, text entered can be displayed in the form of images corresponding to the glyph. As this procedure may be network- and computer-intensive, work is in progress to explore the development of front end processing software that is fully Unicode-compliant and fully trilingual.
Web page information providers can easily create multilingual Web pages if they create them using Unicode as their main internal code. The WWW is a good medium to demonstrate the power of Unicode to meet the global access requirement. Supporting Web pages coded in other code standards is inevitable. Though Unicode-supporting systems are emerging, there are still many users out there who are using tools that can only read one or a few of the existing industry codes.