Europe and the International Character Sets: Strategy of Implementation and Development of the Networked Services

Borka Jerman-Blazic <borka@e5.ijs.si>
Laboratory for Open Systems and Networks
Jozef Stefan Institute, University of Ljubljana, Slovenia

Abstract

This paper represents a summary of the report prepared under mandate M/037 from the European Commission (EC) to develop a European strategy for Character Set Technology (CST) and internationalization. The problems related to internationalization are not Europe-specific, but due to technical development and the large number of languages, they are of particular concern to Europeans. Therefore, Europe has elaborated on some solutions which will provide the necessary level of internationalization for the European languages. These solutions are direct outcomes of the strategy developed under the mandate M/037 issued to CEN (the European Committee for Standardization)by EC.

1. Introduction: The European dimension

Europe comprises some 45 states and around 700 million inhabitants. In Europe, some 160 different indigenous languages (out of which maybe 70% have written forms [1] are spoken, each with its particular socio-cultural characteristics. In addition, a large number of non-indigenous languages are spoken by substantial immigrant communities.

The development of education, business, communication, and leisure together with many other factors has led to the situation where most Europeans are able to understand--and in many cases speak--more than one language.

Europe today thus finds itself in a unique situation in the developed world. In a relatively small, relatively densely populated area, many different languages and cultures are mixed. Different languages are used in the swiftly growing communication between different people and areas, a phenomena which puts obvious demands on the means of communication.

Intra-European trade is increasing, as is business in general. As a result, growth of intra-European travel and administration contact follows. Intra-European tourism and other leisure activities involving communication by many different means is also growing, which has led to the growth of communication services demand.

The recent and ongoing upsurge in the development and use of multimedia is a striking example of culture-dependent, software based communication. The need for the localization of such software is obvious; the need for the internationalization and localization of means of communication for professional purposes has long been established and becomes more urgent by the day. Europe has been aware of these needs and problems, which is reflected in many European initiatives.

2. European Union (EU) initiatives

As of January 1, 1995 the European Union acquired three new members, bringing in two new Union languages plus Sami languages. Right now, discussions on the inclusion of the former Eastern Block European countries are progressing, and even if it will likely take five to seven years before the first extension of the Union is made in that direction, business contacts and administrative cooperation are growing rapidly.

Because of the need for a flow of administrative information among EU member states and between the member states and the Commission, some time ago, the Commission began to initiate projects which could provide the requisite conditions. One of the most ambitious projects, the European Nervous System (ENS), also known as "Support for the Establishment of Trans-European Networks between Administrations" triggered additional acts in language support due to difficulties encountered in the field. Similar acts followed projects driven by private markets, but with substantial support from the EU, e.g. TEDIS (Trade EDI Systems) [2].

The preservation and promotion of cultural and linguistic diversity is one of the guiding principles on which EU policy for the information society rests [3]. The Commission is preparing additional acts to address the European linguistic issues and the means to stimulate the emerging language-based industry. These activities address the stimulation, coordination, and regulatory initiatives to be undertaken in cooperation with the member states for the creation of a linguistic infrastructure of resources and services that improve language support in data communication networks. The proposed measures increase the use and efficiency of information and communication systems while contributing to the enrichment of the Europe's linguistic diversity. The result is that the European Union is making large efforts to create an extensive, multi-purpose data communications network encompassing all member states and reaching toward prospective members. Clearly, these efforts are also oriented toward the provision of facilities for the handling of alphabetical and cultural differences of the European nations [4].

3. The program

The Commission is funding a Linguistic Research and Engineering program--an effort to involve industry in linguistic engineering and to provide European users with the infrastructure that will enable some 360 million EU citizens to handle text information in a variety of languages. Some of funded projects are: Multilingual Application Interface for Telematic Services; Multilingual Authoring of Business Letters; Multilingual Access to Yellow Pages; and Open Translation Environment for Localization, among others [5].

In the field of standardization and character sets, the work is driven through the definition of user requirements. An example of this is the project undertaken by ETSI [6], which specifies the European Culturally Specific Requirements in middleware. Another example that addresses the character sets requirements and specifies EU strategy is the project undertaken by CEN/TC304 [7]. The main objective of that project was to prepare acts and activities on the development and implementation of standards (where necessary) in a way that will facilitate computer-based communication between people in different languages and cultures. The major aim was the development of methodology which will pull other regions along, even if the languages and cultural elements are locally specific.

As a part of the program, a workshop was set up by CEN/TC304 to give an extensive overview of user requirements concerning internationalization and character sets technology. Its results [8], provided guidance to the project team that prepared the report for EC.

4. Summary of the EU requirements for character sets technology and internationalization

The report prepared by the TC304 Project team [4] identifies several types of users, ranging from average end users through manufacturers, service providers, and developers of international standards. In general, the following requirements were identified to apply:

The user needs to be able to communicate (with computers and via computer networks) in a natural language of his own choice, i.e., to be able to use the full orthography of that language, and to be able to use a variety of scripts and characters as necessary.
Other identified requirements deal with cultural conventions. It was recognized that there is a need for proper documentation and standardization of the cultural conventions in Europe, such as orthography, to appropriately implement those conventions to be provided in networked services and products. Examples of such services and products are:
- The X.500 directory services/yellow pages;
- E-mail services;
- World Wide Web services and related applications; and
- Programming languages.
As far as character sets are concerned, it was found that new objects have been introduced in the standards for Directory services which make internationalization and language specification possible, but no applications with these new features yet exist. For a long time, MIME has had the possibilities of identification of the character sets in preparing the body parts, as well as the ability to use characters outside 7-bit ASCII in the header. The WWW service is starting to develop applications that provide some support of different languages. The use of different character sets in Web browsers is possible too. The last draft request for comments (RFC) on Hypertext Markup Language (HTML) Internationalization defines as a document character set the Unicode or BMP of ISO 10 646 (in the text that follows UCS) which contains huge number of characters that enable the representation of almost all known written languages..
Language-related requirements are much larger than just character set requirements, they also include some specific elements that depend on the particular culture, such as data and time formatting, number and currency formatting, etc. [9]. The requirements also include:
- Translation, which includes:
  - mechanisms for identifying existing translation documents in the network services;
  - mechanisms for submitting requests for translation over the Web;
  - ways of supplying definition of terms used that are language-sensitive;
  - transliteration and transcription; and
  - rules of applications (Europe has within its bounds six scripts: Latin, Greek, Cyrillic, Armenian, and Georgian).
- Mechanisms for linking illustration to language-dependent overlays to allow pictures to be labeled in a linguistically suitable way

5. Character sets: EU strategy and implementation

Europe and the rest of the world seem to still have mixed opinions regarding the use of character sets in network services. The TC304 report sees the future in UCS use but problems related to this vision are still on the table.

Users still use equipment based on 8-bit coding and network services do not implement UCS. Users are usually forced to accept what the suppliers offer, and the selection of character sets for any one product is usually narrow. Over the years, a base of installed hardware, software and data has been built up in Europe which uses a range of different character sets and codes (more than 50 coded character sets, 7-bit or 8-bit based). Therefore, although UCS is able to cater to all current and projected needs, there will exist during the foreseeable future a number of different coding schemes.

Nevertheless, the user should ideally perceive that the character set interface of his choice is the same all over the world, regardless of region or country where the data originated, and without intrusion of any underlying technical complexity of communication service. As a consequence, conversion capabilities between all existing coded character sets will be required. UCS is the primary building block for this, since it encompasses the majority of coded character sets used in the world and all of those used in Europe. Its implementation will provide the required integrity of characters as well as the required support for multi-linguality in advanced networked services.

Obstacles for the wide use of UCS are still present. Some of them identified in the report are:

Mechanisms for the easy and default recognition of a stream of data coded according to the UCS scheme;
Easy conversion tools for the conversion of different UCS forms (e.g. UTF 8, UTF 16, UTF 7) and their inclusion in applications;
Easy and friendly input methods;
Representation and rendition problems;
Users believe that 8-bit coded character sets will be present for long time yet in Europe; and
The cost of wide implementation of UCS is not known.

6. A step forward in the EU: The European subsets of UCS

Some of the implementation problems presented above could be solved by the provision of subsets of UCS. It was estimated that full UCS implementation will be costly in the first stage of UCS use and that suppliers will implement only a subset or subsets. To ensure that a common subset which can be used by the vast majority of European users be available for a reasonable price, and as a guide to the manufacturers, a standard was developed and adopted (i.e. the European subsets of UCS) that encompass all characters in European languages. In addition, frequently used characters provided in the equipment of different suppliers were added to the subsets.

The European standard ENV1973 consists of two subsets: the Mandatory European Subset (MES) and the Extended European Subset (EES). The MES encompass the three major European scripts--Latin, Greek, Cyrillic--and the special characters identified through the most popular proprietary coded character sets used in Europe, as well as registered character sets used in telematic services like ISO 6937 (T.61). MES coding supports level 1 of ISO 10 646. With the character repertoire of the MES, 27 official languages using Latin, 5 official languages using Cyrillic script, Greek (politonico and monotonico), and 16 regional languages using Latin script can be written. (See Appendix 1.) The number of characters in the subset is 926.

The EES is another UCS selected subset and is a superset of the MES. EES differs from MES mainly by basing its selection of characters by script and function, rather than use in particular languages. EES includes, exhaustively, the collection of UCS characters containing Latin, Greek, Cyrillic, Armenian, and Georgian scripts, together with those collections of symbols used academically, commercially, and scientifically in Europe. These scripts comprise a set of historically related alphabetic scripts of singular cultural importance to Europe. The number of characters in the EES is 3109.

The other step undertaken by CEN/TC304 was the adoption of the standard for the registration of cultural elements. The European standard ENV12005 is a helpful source of information for software developers. It provides worldwide visibility and unique references to cultural conventions. The data provided through the register are accessible over the network. The data and information are specified in a formal way (similar to POSIX locales) and in a narrative way. Both European subsets are registered under ENV12005 as repertoire maps.

7. Ongoing work

Users may have data represented in different forms on various stages of computing for as long as can be foreseen; most likely this will eventually be in various forms of UCS, but data encoded by other methods will also be in common use for a long period of time. To presume a situation where only one character set exists is not realistic. This is true for Europe, even though UCS is capable of covering all European needs. For various reasons, some specialized equipment (e.g. bar code or OCR-B recognition) will not be able to comply with the ENV 1793 because it will not be able to present to the user the correct glyphs and required coded characters. It is thus essential that tools are developed for the coexistence of UCS and other character sets. Therefore, a standard is being developed in Europe which specifies a way in which those inadequate systems can approximate compliance with MES by providing a fallback conversion which is optimized for legibility.

Current 7-bit systems for stripping down 8-bit encoded texts, such as receiving a text from Russia originally written in Cyrillic, are likely to find the text represented as "quoted printable" which is for most part unreadable. The bare minimum conversion, if a conversion must be performed in order to make the text readable, should be into legible Latin characters. A one-to-one conversion from 8-bit to 7-bit is by definition irreversible; therefore the standard which is developed by CEN will specify the superior Latin characters to be used for such conversion. The invariant part of ISO 646 (7-bit ASCII) is taken as the base for fallback conversion. The repertoire for which fallback is provided is that of MES/EES.

8. Conclusion

The problems related to the internationalization of network services are not specific to Europe, but, due to the level of technical development and the large number of languages, they are of particular concern to Europeans. Therefore, Europe has elaborated on some solutions which will provide the necessary level of internationalization for the European languages. The respective bodies in Europe (EWOS, CEN) have proposed as a European solution the BMP of ISO 10 646, but as the number of characters in this standard is huge, subsets were designed and promoted as European standards. In spite of this European solution, the needs of Europeans to communicate with organizations and people cannot be neglected. This imposes firmer cooperation regarding the principles of internationalization with other non-European bodies and organizations.

Furthermore, network services will be used for a long time to come, using 8-bit coding methods alongside UCS-based services. It is obvious that smooth communication between these two environments will be needed. An advantageous way of handling this coexistence is transformation, conversion, and fallback representation. Therefore, proper methods for transcription/transliteration are being developed in Europe.

9. References

[1] CEN/TC 304 N379+N439, Draft for P11: Repertoires of letters used for writing the indigenous languages of Europe.
[2] http://www.ispo.cec.be/infosoc/legreg/actionla.html - Europe's way to the information society: an action plan (Updated version, May 1995).
[3] Cordis Focus Supplement 6, 17 February 1995: Europe's way to the Information Society.
[4] EC-DG.XIII Linguistic Research and Engineering (LRE): An overview, June 1994.
[5] http://www.echo.lu - Multilingual Action Plan (MLAP): LRE--Overview of the actions launched in 1994 (October 1994).
[6] EPIISG Project 4.d: European Culturally Specific Requirements in middleware service, ETSI, 1996.
[7] Final Report, User requirements study and programming in the field of Character Sets Technology, CEN TC304, August 1995.
[8] Proceedings of the CST-workshop, held in Luxembourg 1-2 December 1994.
[9] STRÍ TS3: Nordic Cultural Requirements on Information Technology (INSTA Technical report, STRÍ TS3, 1992).

Appendix 1: Languages covered by the Mandatory European Subset

According to annex B of ENV 1973, at least the following languages are covered by the Mandatory European Subset of UCS:

Latin script:

Albanian, Basque, Breton, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Friulian, Gaelic, Galician, German, Greenlandic, Hungarian, Icelandic, Irish Gaelic, Italian, Ladin, Latin, Latvian, Lëtzebuergesh, Lithuanian, Livonian, Maltese, Manx Gaelic, Norwegian, Polish, Portuguese, Romanian, Rumantsch, Sámi languages, Spanish, Swedish, Turkish, Welsh.

Cyrillic script:

Bulgarian, Belorussian, Macedonian, Russian, Serbian, Ukrainian.

Greek script:

Greek.