Europe and the International Character Sets: Strategy of Implementation and Development of the Networked Services

Borka Jerman-Blazic <borka@e5.ijs.si>
Laboratory for Open Systems and Networks
Jozef Stefan Institute, University of Ljubljana, Slovenia

Abstract

This paper represents a summary of the report prepared under mandate M/037 from the European Commission (EC) to develop a European strategy for Character Set Technology (CST) and internationalization. The problems related to internationalization are not Europe-specific, but due to technical development and the large number of languages, they are of particular concern to Europeans. Therefore, Europe has elaborated on some solutions which will provide the necessary level of internationalization for the European languages. These solutions are direct outcomes of the strategy developed under the mandate M/037 issued to CEN (the European Committee for Standardization)by EC.

1. Introduction: The European dimension

Europe comprises some 45 states and around 700 million inhabitants. In Europe, some 160 different indigenous languages (out of which maybe 70% have written forms [1] are spoken, each with its particular socio-cultural characteristics. In addition, a large number of non-indigenous languages are spoken by substantial immigrant communities.

The development of education, business, communication, and leisure together with many other factors has led to the situation where most Europeans are able to understand--and in many cases speak--more than one language.

Europe today thus finds itself in a unique situation in the developed world. In a relatively small, relatively densely populated area, many different languages and cultures are mixed. Different languages are used in the swiftly growing communication between different people and areas, a phenomena which puts obvious demands on the means of communication.

Intra-European trade is increasing, as is business in general. As a result, growth of intra-European travel and administration contact follows. Intra-European tourism and other leisure activities involving communication by many different means is also growing, which has led to the growth of communication services demand.

The recent and ongoing upsurge in the development and use of multimedia is a striking example of culture-dependent, software based communication. The need for the localization of such software is obvious; the need for the internationalization and localization of means of communication for professional purposes has long been established and becomes more urgent by the day. Europe has been aware of these needs and problems, which is reflected in many European initiatives.

2. European Union (EU) initiatives

As of January 1, 1995 the European Union acquired three new members, bringing in two new Union languages plus Sami languages. Right now, discussions on the inclusion of the former Eastern Block European countries are progressing, and even if it will likely take five to seven years before the first extension of the Union is made in that direction, business contacts and administrative cooperation are growing rapidly.

Because of the need for a flow of administrative information among EU member states and between the member states and the Commission, some time ago, the Commission began to initiate projects which could provide the requisite conditions. One of the most ambitious projects, the European Nervous System (ENS), also known as "Support for the Establishment of Trans-European Networks between Administrations" triggered additional acts in language support due to difficulties encountered in the field. Similar acts followed projects driven by private markets, but with substantial support from the EU, e.g. TEDIS (Trade EDI Systems) [2].

The preservation and promotion of cultural and linguistic diversity is one of the guiding principles on which EU policy for the information society rests [3]. The Commission is preparing additional acts to address the European linguistic issues and the means to stimulate the emerging language-based industry. These activities address the stimulation, coordination, and regulatory initiatives to be undertaken in cooperation with the member states for the creation of a linguistic infrastructure of resources and services that improve language support in data communication networks. The proposed measures increase the use and efficiency of information and communication systems while contributing to the enrichment of the Europe's linguistic diversity. The result is that the European Union is making large efforts to create an extensive, multi-purpose data communications network encompassing all member states and reaching toward prospective members. Clearly, these efforts are also oriented toward the provision of facilities for the handling of alphabetical and cultural differences of the European nations [4].

3. The program

The Commission is funding a Linguistic Research and Engineering program--an effort to involve industry in linguistic engineering and to provide European users with the infrastructure that will enable some 360 million EU citizens to handle text information in a variety of languages. Some of funded projects are: Multilingual Application Interface for Telematic Services; Multilingual Authoring of Business Letters; Multilingual Access to Yellow Pages; and Open Translation Environment for Localization, among others [5].

In the field of standardization and character sets, the work is driven through the definition of user requirements. An example of this is the project undertaken by ETSI [6], which specifies the European Culturally Specific Requirements in middleware. Another example that addresses the character sets requirements and specifies EU strategy is the project undertaken by CEN/TC304 [7]. The main objective of that project was to prepare acts and activities on the development and implementation of standards (where necessary) in a way that will facilitate computer-based communication between people in different languages and cultures. The major aim was the development of methodology which will pull other regions along, even if the languages and cultural elements are locally specific.

As a part of the program, a workshop was set up by CEN/TC304 to give an extensive overview of user requirements concerning internationalization and character sets technology. Its results [8], provided guidance to the project team that prepared the report for EC.

4. Summary of the EU requirements for character sets technology and internationalization

The report prepared by the TC304 Project team [4] identifies several types of users, ranging from average end users through manufacturers, service providers, and developers of international standards. In general, the following requirements were identified to apply:

5. Character sets: EU strategy and implementation

Europe and the rest of the world seem to still have mixed opinions regarding the use of character sets in network services. The TC304 report sees the future in UCS use but problems related to this vision are still on the table.

Users still use equipment based on 8-bit coding and network services do not implement UCS. Users are usually forced to accept what the suppliers offer, and the selection of character sets for any one product is usually narrow. Over the years, a base of installed hardware, software and data has been built up in Europe which uses a range of different character sets and codes (more than 50 coded character sets, 7-bit or 8-bit based). Therefore, although UCS is able to cater to all current and projected needs, there will exist during the foreseeable future a number of different coding schemes.

Nevertheless, the user should ideally perceive that the character set interface of his choice is the same all over the world, regardless of region or country where the data originated, and without intrusion of any underlying technical complexity of communication service. As a consequence, conversion capabilities between all existing coded character sets will be required. UCS is the primary building block for this, since it encompasses the majority of coded character sets used in the world and all of those used in Europe. Its implementation will provide the required integrity of characters as well as the required support for multi-linguality in advanced networked services.

Obstacles for the wide use of UCS are still present. Some of them identified in the report are:

6. A step forward in the EU: The European subsets of UCS

Some of the implementation problems presented above could be solved by the provision of subsets of UCS. It was estimated that full UCS implementation will be costly in the first stage of UCS use and that suppliers will implement only a subset or subsets. To ensure that a common subset which can be used by the vast majority of European users be available for a reasonable price, and as a guide to the manufacturers, a standard was developed and adopted (i.e. the European subsets of UCS) that encompass all characters in European languages. In addition, frequently used characters provided in the equipment of different suppliers were added to the subsets.

The European standard ENV1973 consists of two subsets: the Mandatory European Subset (MES) and the Extended European Subset (EES). The MES encompass the three major European scripts--Latin, Greek, Cyrillic--and the special characters identified through the most popular proprietary coded character sets used in Europe, as well as registered character sets used in telematic services like ISO 6937 (T.61). MES coding supports level 1 of ISO 10 646. With the character repertoire of the MES, 27 official languages using Latin, 5 official languages using Cyrillic script, Greek (politonico and monotonico), and 16 regional languages using Latin script can be written. (See Appendix 1.) The number of characters in the subset is 926.

The EES is another UCS selected subset and is a superset of the MES. EES differs from MES mainly by basing its selection of characters by script and function, rather than use in particular languages. EES includes, exhaustively, the collection of UCS characters containing Latin, Greek, Cyrillic, Armenian, and Georgian scripts, together with those collections of symbols used academically, commercially, and scientifically in Europe. These scripts comprise a set of historically related alphabetic scripts of singular cultural importance to Europe. The number of characters in the EES is 3109.

The other step undertaken by CEN/TC304 was the adoption of the standard for the registration of cultural elements. The European standard ENV12005 is a helpful source of information for software developers. It provides worldwide visibility and unique references to cultural conventions. The data provided through the register are accessible over the network. The data and information are specified in a formal way (similar to POSIX locales) and in a narrative way. Both European subsets are registered under ENV12005 as repertoire maps.

7. Ongoing work

Users may have data represented in different forms on various stages of computing for as long as can be foreseen; most likely this will eventually be in various forms of UCS, but data encoded by other methods will also be in common use for a long period of time. To presume a situation where only one character set exists is not realistic. This is true for Europe, even though UCS is capable of covering all European needs. For various reasons, some specialized equipment (e.g. bar code or OCR-B recognition) will not be able to comply with the ENV 1793 because it will not be able to present to the user the correct glyphs and required coded characters. It is thus essential that tools are developed for the coexistence of UCS and other character sets. Therefore, a standard is being developed in Europe which specifies a way in which those inadequate systems can approximate compliance with MES by providing a fallback conversion which is optimized for legibility.

Current 7-bit systems for stripping down 8-bit encoded texts, such as receiving a text from Russia originally written in Cyrillic, are likely to find the text represented as "quoted printable" which is for most part unreadable. The bare minimum conversion, if a conversion must be performed in order to make the text readable, should be into legible Latin characters. A one-to-one conversion from 8-bit to 7-bit is by definition irreversible; therefore the standard which is developed by CEN will specify the superior Latin characters to be used for such conversion. The invariant part of ISO 646 (7-bit ASCII) is taken as the base for fallback conversion. The repertoire for which fallback is provided is that of MES/EES.

8. Conclusion

The problems related to the internationalization of network services are not specific to Europe, but, due to the level of technical development and the large number of languages, they are of particular concern to Europeans. Therefore, Europe has elaborated on some solutions which will provide the necessary level of internationalization for the European languages. The respective bodies in Europe (EWOS, CEN) have proposed as a European solution the BMP of ISO 10 646, but as the number of characters in this standard is huge, subsets were designed and promoted as European standards. In spite of this European solution, the needs of Europeans to communicate with organizations and people cannot be neglected. This imposes firmer cooperation regarding the principles of internationalization with other non-European bodies and organizations.

Furthermore, network services will be used for a long time to come, using 8-bit coding methods alongside UCS-based services. It is obvious that smooth communication between these two environments will be needed. An advantageous way of handling this coexistence is transformation, conversion, and fallback representation. Therefore, proper methods for transcription/transliteration are being developed in Europe.

9. References

Appendix 1: Languages covered by the Mandatory European Subset

According to annex B of ENV 1973, at least the following languages are covered by the Mandatory European Subset of UCS:

Latin script:

Albanian, Basque, Breton, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Friulian, Gaelic, Galician, German, Greenlandic, Hungarian, Icelandic, Irish Gaelic, Italian, Ladin, Latin, Latvian, Lëtzebuergesh, Lithuanian, Livonian, Maltese, Manx Gaelic, Norwegian, Polish, Portuguese, Romanian, Rumantsch, Sámi languages, Spanish, Swedish, Turkish, Welsh.

Cyrillic script:

Bulgarian, Belorussian, Macedonian, Russian, Serbian, Ukrainian.

Greek script:

Greek.