Languages on the Internet

Jean Bourbonnais <jbourbonnais@alis.com>
François Yergeau <fyergeau@alis.com>

Alis Technologies inc.
100, boul. Alexis-Nihon, bureau 600
Montréal QC H4M 2P2
Tel: +1 (514) 747-2547
Fax: +1 (514) 747-2561

Introduction

The written languages of the world, with the notable exception of English, are not very welcome on the Internet. One cannot but notice the predominance of Shakespeare's language, emphasized not only by the relative absence of other languages, but also by various technical difficulties encountered by non-anglophones. The Internet's roots lie in the United States, and its design, infrastructure, protocols, and standards bear the mark of a lack of attention to the needs of communication in a variety of natural languages. Serious shortcomings can be found in its very foundations--for instance, the mail transport protocol that supports only one character set, sufficient only for English.

The Internet was designed as a highly redundant and fault-tolerant mesh. However, its actual structure today, on a global scale, is much more like a US-centered star. This fact, in addition to reducing the network's robustness, also enhances the dominance of English, the language found on the American sites where so many links lead. It is, moreover, on these sites that are found most of the free or cheap software that has made the Internet what it is. All this software is in English only and often cannot deal with anything but English in the information that it is called upon to process, transmit, receive, and so forth.

A program's interface is not limited to the set of menus, commands, and buttons visible to the user. This interface also encompasses the capacity to present information, especially text. Internet software can also be considered as an interface between the user and the network.

This paper attempts to examine electronic mail software, Web clients and servers, and others as interfaces to the Internet, not only respecting the user-interface in the usual sense, but taking into account the Internet protocols and standards. We will identify the linguistic limitations of these interfaces and study the localization of user-interfaces and the possibilities of natural language processing.

We will separate Internet services into three classes: basic services, those that implement the network itself (IP, TCP, UDP, ICMP, DNS, and so on); messaging services, mail and Usenet (SMTP, ESMTP, NNTP, RFC822, and so on); and information search and retrieval services, or information services (FTP, Telnet, Finger, Gopher, WAIS, HTTP, and so on.).

Basic protocols

With respect to language, there is not much to say about the basic protocols that constitute the framework of the Internet. By and large they are transparent, that is, they transmit or process whatever bytes are given. IP assigns a dotted quad number, very much language independent, to each machine.

Trouble appears as soon as text appears, as in the Domain Name System (DNS): This system is designed so that humans can use meaningful machine names instead of non-mnemonic dotted quads. The mnemonic value, however, because it is limited to ASCII, offers the most to anglophones. Other Latin script users have to compromise (no accents), and other scripts are excluded. The same situation occurs with filenames, often used more or less directly in Internet transactions.

Messaging

Mail

Internet mail is based on a protocol, Simple Mail Transfer Protocol (SMTP), and on a message format standard, request for comments (RFC) 822. Both specifiy the 7-bit ASCII character set exclusively, enabling only transmission of English text messages. Any other use requires masquerading the message as a set of short ASCII-only text lines.

A recent extension to SMTP (appropriately called ESMTP) allows for 8-bit transmission, but the actual benefit is very small: 8-bit transmission must be negotiated, and the negotiation takes place too late for the sending application to take advantage of it. EMSTP would be useful only if servers (more precisely Mail Transport Agents) were able to encode mail upon failure of a request for 8-bit transmission, and if such servers were widespread enough that clients (Mail User Agents) could count on it. The current situation is darker: There are still many servers that enforce the 7-bit restriction by chopping off the 8-bit (at the expense of data integrity), so that 8-bit mail cannot be sent reliably. The situation is worse when gateways to other mail systems are involved.

MIME, an extension to RFC 822, is much more promising. It standardizes encoding methods, but more importantly formalizes labeling of character sets and encoding, allowing unambiguous decoding. The problem with MIME is the lack of universality: One can transmit text in any language, images, sounds, and so forth, but the recipient may not have a MIME decoder to read the message.

Usenet

Usenet (Internet forums) fares a little better. Like mail, with which it shares the RFC 822 message format, it is officially limited to 7-bit ASCII. In practice however, most Usenet software happily transmits 8-bit text, and this has been put to good use for a number of years. Yet 7-bit software (standards conforming!) remains, and character set labeling is nonexistent, leading to a less than satisfactory situation: Only English is reliably supported on Usenet.

Information services

FTP

The File Transport Protocol (FTP), by its very name, should allow the transfer of arbitrary files transparently. This is mostly true, but there is a catch. One of the two transfer modes, called ASCII or TEXT, is meant to compensate for the various line-ending conventions of different platforms. Some implementations, however, take the ASCII name literally and destroy non-ASCII characters, negating the benefit of that mode for users transmitting non-ASCII text files. They have to transfer in BINARY mode and deal later with the line-ending problem.

Archie

Archie doesn't care about character encoding issues, which is a boon as well as a problem. Archie databases can very well contain arbitrary non-ASCII filenames, but the lack of character set labeling means that looking them up is problematic: Matches occur only if the encoding of the request is the same as the (unknown) encoding of the indexed filename, and false positives are possible (one filename encoded one way accidentally matches a request encoded in another way).

Telnet

Telnet implements a remote terminal through an 8-bit channel, which is fine. It doesn't, however, allow for character set identification or negotiation (although developments are under way), limiting non-ASCII operation to "match by chance."

Gopher

Curiously, the Gopher protocol is specified as 8-bit, but mandates ASCII as the character set. This leads, of course, to the absence of character set identification, which spells trouble for those who use it with non-ASCII data. The authors know of a site that has to maintain three versions of its document, in three different encodings.

World Wide Web

The Web is based on three standards, each having an impact on language use: the Hypertext Transfer Protocol (HTTP) protocol, the Hypertext Markup Language (HTML) document format, and the uniform resource locator (URL) addressing scheme. Search engines are also an area of interest and intense development. Let's review these standards one by one.

HTTP

HTTP is an 8-bit protocol that allows for transmission of arbitrary data. Furthermore, its use of MIME-like headers permits character set identification for textual data. Unfortunately, this feature is almost universally unused, and interoperability in the face of multiple character sets is once again a matter of "match by chance." This is not the protocol's fault. Incomplete implementations are to blame.

A long-proposed but not yet adopted feature of HTTP can be very useful for publishing content in multiple languages: language negotiation. Using this, the client requesting a document can provide, in addition to the document's address (a URL), a list of preferred languages; the server can then choose the most appropriate language version from those it holds and return it. Servers have begun to appear that implement this feature. Here is an illustration of the process.

HTTP exchange

Language negotiation is actually part of the larger scheme of content negotiation (encoding, file format, and so on), a rather complicated affair that has not yet been satisfactorily standardized. Advantages are that one address can specify all versions of a document, and that users do not have to wade through a "front" page they don't understand searching for a link to an acceptable version.

URL

URLs are the addresses of Web (and other) documents and resources. They are actually more or less mnemonic names, hence text, but they suffer from the usual limitation: The only character set specified is ASCII. In fact, full 8-bit characters can be used through a special (and ugly) encoding, but there is no way to know which character set they refer to. "%E9" could as well refer to a Latin E ACUTE as to a Cyrillic IOU, or to numerous other characters. The other way around, a URL in printed form (e.g. in a magazine ad or on a business card) cannot be unambiguously converted to the byte sequence required by machines if it contains any non-ASCII character. The bottom line is that URLs are limited either to English words or to meaningless (to humans) ASCII character sequences.

HTML

HTML is supposed to be the lingua franca of the Web. To this day, however, the only standardized version of the language (HTML 2.0) is limited to a smallish character (Latin-1) set barely adequate for Western languages. A far-reaching extension has been proposed and is nearing acceptance by the IETF. It introduces Unicode as the HTML document character set, a move that allows text in most of the world's languages without sacrificing compatibility with current practice. In effect, any character set that is a subset of Unicode becomes usable at the HTML level. This extension also adds to HTML a handful of features necessary for proper rendering of many non-Latin scripts, as well as features designed to compensate for the unfortunate lack of character set identification in actual HTTP practice.

Search engines

Current search engines and sundry indexes are useful and efficient, but they deal less than gracefully with non-English documents. At best, they assume the Latin-1 character set and make no provision for multiple languages. Given the dominance of English documents, the result of the latter shortcoming is that users looking for documents in other languages have to wade through large numbers of matches to their queries before they find a real match, and this only if the search engine returns that much data. If the engine limits its answers to a small number, the chance that the ones interesting to a non-English user will be included is pretty small. The character set problem, of course, is the same as with other services: The indexer doesn't know what it is indexing, and the client doesn't know how to encode its request to match what the indexer has put in its database. The desired matches occur only by chance; undesirable matches also occur; and the benefits of case-insensitive or word-only searching are lost in many cases.

WAIS

The Wide Area Information Servers (WAIS) protocol is primarily a document search protocol, based on an indexer, a server and a client. Most implementations are limited to ASCII, hence to English. At least one version deals timidly with 8-bit characters, but does not even attempt to solve the problems of character set identification and conversion, case conversion (outside of ASCII), diacritic-independent searching and language-dependent "empty words" (the list of "small" words, prepositions, articles and such that the indexer should ignore for efficiency in both indexing and search).

Conclusion

From what precedes, it appears that all the Internet interfaces (in a wide sense) have serious shortcomings when it comes to using them in any language but English. Not only are the user-interfaces almost always in English only (a situation that is slowly improving), but the programs and protocols often have problems dealing with content in other languages. These difficulties vary with each service, the more modern ones being generally better (at least 8-bit clean), but we are no where near equality of languages on the global information highway.