Multilanguage and Multicharset Web Server

Konstantin V. Chuguev <joy@urc.ac.ru>
Division of Wide-Area Networking Technology
Chelyabinsk Technical University
South Ural Regional Center of FREEnet
76, Prospekt Lenina, Technical University
454080, Chelyabinsk, Russia
Tel.: +7 (3512) 65-4992

Abstract

A problem with processing distributed text information in many non-English-speaking countries is distinguishing between character sets on different operating systems. In this connection there is a great need for providing recoding between the information resource's charset and the client viewer's one. And as far as possible the recoding must be transparent to the user. This is a complex problem for several reasons.

Now, the most popular, widespread and powerful Internet service is the World Wide Web. Therefore, the most essential task associated with common internationalization/localization problem is the creation of multilanguage/multicharset Web clients and servers.

This paper describes a system being developed at the South Ural Regional Center of FREEnet. The system includes modules, relatively independent of each other:

Charset recoding library with some useful utilities (supporting only 8-bit constant-wide character sets for the present)
A module for determining the client's and server's language and charset (valid for that language)
A set of patches for some (probably the most popular) Web servers

Introduction

At present, with the rapid development of Russian wide area network infrastructure, a very important task is the creation and maintenance of informational resources in all fields of activities. It is desirable to keep many of these resources in several languages, suitable for both Russian and foreign users.

A problem with processing distributed text information in many non English-speaking countries is distinguishing between character sets on different operating systems (OS). Thus, at present there are five different charsets used in Russia (only letters of the Russian alphabet are shown):

CP866, also known as MS DOS alternative Cyrillic charset [1]
CP1251, MS Windows Cyrillic charset [2]
KOI8-R (or KOI8) on Unix systems (this is de facto standard for electronic mail interchange in Russia) [RFC1489] (registered by IANA [RFC1700])
ISO 8859-5, the only Cyrillic charset supported in MIME and by large Unix manufacturers such as Sun Microsystems [RFC1345] (registered by IANA [RFC1700])
MacOS Cyrillic on Apple computers [3]

Furthermore, there are many other countries where more than one charset is used.

The problem

In this connection there is a great need for providing recoding between the information resource's charset and the client viewer's one. And as far as possible the recoding must be transparent to the user. This is a complex problem for the following reasons:

Most existing Internet information and communication systems (FTP, Telnet, WAIS, old but still widespread [RFC822]-based e-mail and Usenet news implementations) do not understand the concept of character set and character coding at all.
The MIME specification [RFC1521,RFC1522] does not help enough here, because its implementations support only appointed set of character sets (e.g., only the charset--ISO-8859-5--from five charsets used in Russia), and are hard to extend.
Although several organizations and groups in the Internet are working on the classification and formal definition of character sets and character encodings (http://www.unicode.org/, ftp://dkuug.dk/i18n/, [RFC1345], TERENA's Internationalisation of Network Services Working Group [4]), results of their research are mainly theoretical. Of course, it's impossible to make good and flexible tools without creation of a theoretical basis, but end-users (at least in Russia as well as in other countries) really need a working toolkit for charset recoding now. And a compromise should be made and such a toolkit should be given to them, even if it has rather limited capabilities.
Existing tools in the Internet for charset recoding are unfit for users' needs for different reasons. For example, GNU recode [5] does not exist as a library (i.e. does not have an API), but as a utility only; it mixes up the concepts of character sets and usages (e.g., HTML and LaTeX charsets), does not know about charset description tables, and operates with recoding tables/functions only.

Also, although TERENA's C3 System pre-alpha project seems to become universal and flexible (see ftp://ftp.nada.kth.se/pub/i18n/c3/current-release/README-c3-ap45 for details), it is still (for more than a year) in the initial stage of development. This is understandable, because the people involved in the TERENA Technical Programme are generally very busy, with most of them performing their TERENA work in spare moments from their regular jobs. However, there are occasionally opporunities to take things less seriously [6].

But users cannot wait.

Now, the most popular, widespread and powerful Internet service is the World Wide Web. Therefore, the most useful task associated with the common internationalization/localization problem is the creation of multilanguage/multicharset Web clients and servers.

Solution

This paper describes a system being developed at the South Ural Regional Center of FREEnet (FREEnet is the Russian Network for Research, Education, and Engineering). The system includes modules, relatively independent of each other:

Charset recoding library with some useful utilities (supporting only 8-bit constant-wide character sets for the present)
A module for determining the client's and server's language and charset (valid for that language)
A set of patches for some (probably the most popular) Web servers

Charset recoding library

This library consists of three parts:

The Library module itself, librecode.a, with API described in recode.h header file. The API is very simple and does not seek for admission as a standard. It supports only 8-bit constant-wide charsets for the following reasons: (1) simplicity (1 byte to 1 byte recoding), which is enough for almost all non-Asian languages' charsets, and (2) ease for patching existing applications (Web servers and others).
Charset database, containing a description of each used charset in the system. Each charset is given in the simplified version of the format, desribed in [RFC1345]. This format has been chosen because of its clear character mnemonics used for representing each character. Here is an example:

   ISO_8859-5

      NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
      DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
      SP ! " Nb DO % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
      At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z <( // )> '> _
      '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT
      PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3
      DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC
      NS IO D% G% IE DS II YI J% LJ NJ Ts KJ -- V% DZ
      A= B= V= G= D= E= Z% Z= I= J= K= L= M= N= O= P=
      R= S= T= U= F= H= C= C% S% Sc =" Y= %" JE JU JA
      a= b= v= g= d= e= z% z= i= j= k= l= m= n= o= p=
      r= s= t= u= f= h= c= c% s% sc =' y= %' je ju ja
      N0 io d% g% ie ds ii yi j% lj nj ts kj SE v% dz

A set of utilities based on the library, used for different purposes, e.g., (1) output of the recoding table from one charset to another in various formats (useful, for example, for including such a table as an array into C source code), (2) recoding from standard input (or any file) to standard output (this utility is used in our FTP server for recoding Russian text files on-the-fly; see ftp://ftp.urc.ac.ru/), and (3) the recoding filter providing a user with transparent recoding when working with a text in one charset from a terminal supporting another one; our center's staff uses this utility during Telnet sessions from DOS or Windows (with CP866 and CP1251 charsets correspondingly) machines to Unix server (with KOI8-R).

Determination of the character set

The important task for the Web server is determination of the information resource's language and charset preferrable for a client and available in the server. Because there are very few Web browsers supporting Accept-Language and Accept-Charset [7,8] fields in HTTP header, the server is forced to attempt to get this information from other sources as well. The implemented methods are absent in HTTP specifications (and are even based upon illegal but used in current practice features). Nevertheless it is the methods that provide proper functioning of such widespread and popular Web browsers as Netscape. And anyway these "roundabout" methods will go away when most browsers become HTTP/1.1-compatible.

For example, some browsers can send additional entries in the Accept field; they can be set up to send the field:

        Accept: x-language-ru, x-charset-cp1251

Unfortunately, this does not work with Netscape browsers: although the 2.0 version for Windows has the capability of setting up the Accept-Language, a user cannot modify the Accept field being sent through the HTTP protocol, so it is impossible for a user to declare the preferred charset. In this case the server analyzes the User-Agent field, attempting to detect the browser's operating system (at least this works for Netscape). Then, there is a table with correspondences between the OS, language and character set used. As a rule, there is one charset for the given OS and language.

The server also maintains a database with information on preferred language/charset of different client hosts. This database is accessible through the Web (for each host, only its own record in the database is accessible) and has the highest priority in determining the client's language and charset. The database is intended for browsers where none of the above-mentioned methods works. But the server cannot determine the client's IP address or domain name if the latter works through a cache/proxy. In that case (which is detected by presence of the " via " substring in the User-Agent field; both CERN and Harvest caches write this), the access to this database is disabled.

Documents in different languages are kept on the server as different files with similar names, but are referenced by the same language-independent URL. This allows clients to get a document in another language (which has lower priority) in the case where one in the requested language is absent. Charset recoding is realized on-the-fly, and there is only one copy of the document in each language on the server.

Special attention is paid to providing language/charset-independent search capabilities, either incorporated in the server (as in John Franks' WN) or operating through external CGI tools (W3C's or NCSA's httpd).

References

[1] Codepage 866 (MS DOS alternative Cyrillic/Russian charset). ftp://unicode.org/pub/MappingTables/Microsoft/pc/CP866.X.

[2] Codepage 1251 (MS Windows Cyrillic charset). ftp://unicode.org/pub/MappingTables/Microsoft/windows/cp1251.x.

[3] MacOS Cyrillic Charset. ftp://unicode.org/pub/MappingTables/Apple/MacOS_Cyrillic.txt.

[4] I18N: TERENA's Internationalisation of Network Services Working Group. C3: A System for Coded Character Set Conversion. http://www.nada.kth.se/i18n/c3.

[5] GNU. http://info.desy.de/gnu/www/GNU.html.

[6] The TERENA Technical Programme. http://www.terena.nl/terena/technical-programme.

[7] Berners-Lee, T.; Fielding, R.; Frystyk, H., Hypertext Transfer Protocol: HTTP/1.0. Internet Draft. February 1996. http://www.w3.org/pub/WWW/Protocols/HTTP/1.0/spec.

[8] Fielding, R.; Frystyk, H.; Berners-Lee, T., Hypertext Transfer Protocol: HTTP/1.1. Internet Draft. January 1996. http://www.w3.org/pub/WWW/Protocols/HTTP/1.1/spec.

[RFC822] Crocker, D., Standard for the Format of ARPA Internet Text Messages. August 1982. ftp://ftp.internic.net/rfc/rfc822.txt.

[RFC1345] Simonsen, K., Character Mnemonics and Character Sets. 1992 June. ISO_8859-5:1989. ftp://ftp.internic.net/rfc/rfc1345.txt, ftp://unicode.org/pub/MappingTables/ISO8859Maps/8859-5.txt.

[RFC1489] Chernov, A., "Registration of a Cyrillic character set." July 1993. ftp://ftp.internic.net/rfc/rfc1489.txt.

[RFC1521] Borenstein, N.; Freed, N., MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies. September 1993. ftp://ftp.internic.net/rfc/rfc1521.txt.

[RFC1522] Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text. September 1993. ftp://ftp.internic.net/rfc/rfc1522.txt.

[RFC1700] Reynolds, J.; Postel, J., Assigned Numbers. October 1994. ftp://ftp.internic.net/rfc/rfc1700.txt.