Multilanguage and Multicharset Web Server

Konstantin V. Chuguev <joy@urc.ac.ru>
Division of Wide-Area Networking Technology
Chelyabinsk Technical University
South Ural Regional Center of FREEnet
76, Prospekt Lenina, Technical University
454080, Chelyabinsk, Russia
Tel.: +7 (3512) 65-4992

Abstract

A problem with processing distributed text information in many non-English-speaking countries is distinguishing between character sets on different operating systems. In this connection there is a great need for providing recoding between the information resource's charset and the client viewer's one. And as far as possible the recoding must be transparent to the user. This is a complex problem for several reasons.

Now, the most popular, widespread and powerful Internet service is the World Wide Web. Therefore, the most essential task associated with common internationalization/localization problem is the creation of multilanguage/multicharset Web clients and servers.

This paper describes a system being developed at the South Ural Regional Center of FREEnet. The system includes modules, relatively independent of each other:

  1. Charset recoding library with some useful utilities (supporting only 8-bit constant-wide character sets for the present)
  2. A module for determining the client's and server's language and charset (valid for that language)
  3. A set of patches for some (probably the most popular) Web servers

Introduction

At present, with the rapid development of Russian wide area network infrastructure, a very important task is the creation and maintenance of informational resources in all fields of activities. It is desirable to keep many of these resources in several languages, suitable for both Russian and foreign users.

A problem with processing distributed text information in many non English-speaking countries is distinguishing between character sets on different operating systems (OS). Thus, at present there are five different charsets used in Russia (only letters of the Russian alphabet are shown):

Furthermore, there are many other countries where more than one charset is used.

The problem

In this connection there is a great need for providing recoding between the information resource's charset and the client viewer's one. And as far as possible the recoding must be transparent to the user. This is a complex problem for the following reasons:

Also, although TERENA's C3 System pre-alpha project seems to become universal and flexible (see ftp://ftp.nada.kth.se/pub/i18n/c3/current-release/README-c3-ap45 for details), it is still (for more than a year) in the initial stage of development. This is understandable, because the people involved in the TERENA Technical Programme are generally very busy, with most of them performing their TERENA work in spare moments from their regular jobs. However, there are occasionally opporunities to take things less seriously [6].

But users cannot wait.

Now, the most popular, widespread and powerful Internet service is the World Wide Web. Therefore, the most useful task associated with the common internationalization/localization problem is the creation of multilanguage/multicharset Web clients and servers.

Solution

This paper describes a system being developed at the South Ural Regional Center of FREEnet (FREEnet is the Russian Network for Research, Education, and Engineering). The system includes modules, relatively independent of each other:

  1. Charset recoding library with some useful utilities (supporting only 8-bit constant-wide character sets for the present)
  2. A module for determining the client's and server's language and charset (valid for that language)
  3. A set of patches for some (probably the most popular) Web servers

Charset recoding library

This library consists of three parts:

   ISO_8859-5
NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US SP ! " Nb DO % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z <( // )> '> _ '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3 DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC NS IO D% G% IE DS II YI J% LJ NJ Ts KJ -- V% DZ A= B= V= G= D= E= Z% Z= I= J= K= L= M= N= O= P= R= S= T= U= F= H= C= C% S% Sc =" Y= %" JE JU JA a= b= v= g= d= e= z% z= i= j= k= l= m= n= o= p= r= s= t= u= f= h= c= c% s% sc =' y= %' je ju ja N0 io d% g% ie ds ii yi j% lj nj ts kj SE v% dz

Determination of the character set

The important task for the Web server is determination of the information resource's language and charset preferrable for a client and available in the server. Because there are very few Web browsers supporting Accept-Language and Accept-Charset [7,8] fields in HTTP header, the server is forced to attempt to get this information from other sources as well. The implemented methods are absent in HTTP specifications (and are even based upon illegal but used in current practice features). Nevertheless it is the methods that provide proper functioning of such widespread and popular Web browsers as Netscape. And anyway these "roundabout" methods will go away when most browsers become HTTP/1.1-compatible.

For example, some browsers can send additional entries in the Accept field; they can be set up to send the field:

        Accept: x-language-ru, x-charset-cp1251 

Unfortunately, this does not work with Netscape browsers: although the 2.0 version for Windows has the capability of setting up the Accept-Language, a user cannot modify the Accept field being sent through the HTTP protocol, so it is impossible for a user to declare the preferred charset. In this case the server analyzes the User-Agent field, attempting to detect the browser's operating system (at least this works for Netscape). Then, there is a table with correspondences between the OS, language and character set used. As a rule, there is one charset for the given OS and language.

The server also maintains a database with information on preferred language/charset of different client hosts. This database is accessible through the Web (for each host, only its own record in the database is accessible) and has the highest priority in determining the client's language and charset. The database is intended for browsers where none of the above-mentioned methods works. But the server cannot determine the client's IP address or domain name if the latter works through a cache/proxy. In that case (which is detected by presence of the " via " substring in the User-Agent field; both CERN and Harvest caches write this), the access to this database is disabled.

Documents in different languages are kept on the server as different files with similar names, but are referenced by the same language-independent URL. This allows clients to get a document in another language (which has lower priority) in the case where one in the requested language is absent. Charset recoding is realized on-the-fly, and there is only one copy of the document in each language on the server.

Special attention is paid to providing language/charset-independent search capabilities, either incorporated in the server (as in John Franks' WN) or operating through external CGI tools (W3C's or NCSA's httpd).

References

[1] Codepage 866 (MS DOS alternative Cyrillic/Russian charset). ftp://unicode.org/pub/MappingTables/Microsoft/pc/CP866.X.

[2] Codepage 1251 (MS Windows Cyrillic charset). ftp://unicode.org/pub/MappingTables/Microsoft/windows/cp1251.x.

[3] MacOS Cyrillic Charset. ftp://unicode.org/pub/MappingTables/Apple/MacOS_Cyrillic.txt.

[4] I18N: TERENA's Internationalisation of Network Services Working Group. C3: A System for Coded Character Set Conversion. http://www.nada.kth.se/i18n/c3.

[5] GNU. http://info.desy.de/gnu/www/GNU.html.

[6] The TERENA Technical Programme. http://www.terena.nl/terena/technical-programme.

[7] Berners-Lee, T.; Fielding, R.; Frystyk, H., Hypertext Transfer Protocol: HTTP/1.0. Internet Draft. February 1996. http://www.w3.org/pub/WWW/Protocols/HTTP/1.0/spec.

[8] Fielding, R.; Frystyk, H.; Berners-Lee, T., Hypertext Transfer Protocol: HTTP/1.1. Internet Draft. January 1996. http://www.w3.org/pub/WWW/Protocols/HTTP/1.1/spec.

[RFC822] Crocker, D., Standard for the Format of ARPA Internet Text Messages. August 1982. ftp://ftp.internic.net/rfc/rfc822.txt.

[RFC1345] Simonsen, K., Character Mnemonics and Character Sets. 1992 June. ISO_8859-5:1989. ftp://ftp.internic.net/rfc/rfc1345.txt, ftp://unicode.org/pub/MappingTables/ISO8859Maps/8859-5.txt.

[RFC1489] Chernov, A., "Registration of a Cyrillic character set." July 1993. ftp://ftp.internic.net/rfc/rfc1489.txt.

[RFC1521] Borenstein, N.; Freed, N., MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies. September 1993. ftp://ftp.internic.net/rfc/rfc1521.txt.

[RFC1522] Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text. September 1993. ftp://ftp.internic.net/rfc/rfc1522.txt.

[RFC1700] Reynolds, J.; Postel, J., Assigned Numbers. October 1994. ftp://ftp.internic.net/rfc/rfc1700.txt.