Multilingual Databases and Global Internet Integration

Michael G. McKenna <mgm@sybase.com>
Sybase, Inc.
USA

Abstract

With the blinding growth of the Internet, and subsequent adoption of its paradigms for the intranet for business communications and information management, the benefits as well as the inherent problems of status quo global cyberspace have been inherited by information managers.

Specifically, for global organizations, the problem of multilingual access and multinational management of data has surfaced. How can a Chinese business track its shipments to other Chinese locations on an Internet tool designed for an American-based overnight shipping company? How should a marketing, insurance, or online shopping company manage its mailing lists and customer information in a truly World Wide Web? With the Internet, traditionally local companies become global entities overnight. Global organizations, when implementing an intranet solution, must deal with cross-cultural business rules, heterogeneous data encodings, and multiple scripts for the representation and manipulation of data.

When databases are connected to the Web, they may use different character set encodings, and different database vendors use different, proprietary protocols to ascertain or communicate the encoding being used.

By keeping track of encodings, and languages, a developer can normalize to a universal character set (Unicode) for increased data integrity in data transmissions. Unicode can now be used in operating systems from Microsoft, IBM, DEC, and Sun, is available for use in database systems from Sybase, Adabas, Informix, Taligent and Oracle, and is the internal process code for Java. A tagging and encoding scheme for Unicode has been agreed upon by the IETF which enables its use for MIME encoded e-mail and HTML.

Many database vendors are actively pursuing the Internet marketplace by providing database, connectivity, search engines, multimedia, and developer tools. After analyzing the problems presented above, many of them came to realize that an integral part of the global solution is to provide Unicode solutions.

This paper will present these issues, with an analysis, and a look at solutions being implemented by database and systems providers.

Keywords: Internet case studies, expanding access, commerce, applications, Sybase, databases, Unicode, character sets, locales.

Introduction
Generic international access problems
Multilingual database problems
Solutions
Current status
Future prospects
Conclusion
References

Introduction

The World Wide Web (WWW) has been growing at a logarithmic pace for several years now and shows no sign of slowing down soon[1], barring bandwidth and throughput problems. As it becomes as ubiquitous as the telephone, the Internet and WWW are being adopted as a medium of choice for businesses communicating with internal clients, external clients, and working partners.

Internally, corporations and organizations are using the Internet for e-mail, file-transfer, traditional client/server applications, and the WWW in intranets, usually behind a firewall, for communicating employee benefits, company news, sales information, and disseminating information about processes, organizational structures, etc.

Externally, corporations and organizations are using the WWW in an intranet paradigm primarily as a "marketing" tool. This word is used loosely in that the term itself is a bit vague. Marketing, among other things, entails the dissemination of information about a product or service in the hopes that the information will be used to encourage potential clients to purchase or use that product or service. With this as a definition, then everything from a city listing of parks and playgrounds to a manufacturer describing new products or offering a virtual reality test drive of a new car is marketing.

Another aspect of marketing is the collection, analysis, and maintenance of potential client lists and customer profiles. These may have been entered at an organization's Web page, or have come from a membership, customer, or mailing list. With such information, Web content can be dynamically customized to a particular client's profile, providing user-friendly and effective publishing of information.

This is common knowledge to anyone accessing the WWW today, but the underlying mechanisms to make it work involve more than just Web browsers and Web servers. Access to data management systems is involved. Some are built-in with a Web server, such as those offered by Open Market, Netscape, and Microsoft for managing commerce transactions; others are integrated through middle-ware to relational databases such as those from Sybase, Oracle, Informix, and Microsoft. In essence, the Internet and the Web have become, almost overnight, a preferred interface to information management systems.

Cultural arrogance, ignorance, or impotence?

One of the challenges of the WWW is that even though information is decentralized throughout the world, each server is a potential island of centralized information content. As a result, the cross-cultural needs of users are often ignored. At present, over 85% of the Web sites are in the United States and are targeted at English-only users[2]. Cruising the Web, it is apparent that many sites are aimed at local metropolitan or special-interest audiences. Others are general information sources. Another group is inherently suited to international or cross-cultural audiences.

An example of this latter group would be an organization using an intranet to reach their own employees that may have a majority of content written at a headquarters location in the organization's official language (usually English for Western global organizations). But many employees may not be able to or may not wish to speak English. Sybase, Inc., for instance, has around 6,000 employees in over 50 countries. Several hundred employees work in Japan alone, and they expect to speak, read, and write Japanese in their day-to-day information needs. But the corporation's centralized Web-based employee information systems are primarily English-oriented. A larger example is the Xerox Corporation, with 56,000 employees worldwide. Larger still are Japanese holding companies such as Sanwa, Sumitomo, etc., with a truly global presence.

Whereas an internal intranet for employee usage may seem a potentially complex system if all the languages and cultures of an organization are considered, it is actually a well-controlled situation in that the organization usually knows who works with them and can easily know the language requirements of all users. The complexity starts when the floodgates are opened to the outside world. The first case is an external intranet to clients and partners. A medium-sized civil example would be a metropolitan community information system. In California's Silicon Valley, you would expect, if it was targeted at an audience of local users only, they would all be English speakers. But in California, English is expected to become a minority language sometime early in the 21st century. In the San Francisco Bay Area, where Silicon Valley is situated, over 156 languages are currently spoken. So the assumption of English-only users does not necessarily apply. For a global corporate intranet, the languages of the clients should be considered. Fortunately, in an intranet situation, it is usually known, or easily ascertained, what the majority languages of the users may be.

Gross Domestic Product by Software Internationalization Region
GNP by I18n Region

As a Web-presence is turned loose to the Internet, it becomes a global entity overnight, and the language and culture of the individual accessing the site may come from almost anywhere on the planet. Even though English comprises over 85% of the current content on the Web, English-speaking regions only account for about a third of the world's economy, and less than 10% of the world's population[3]. The cultural proportions of the Internet are changing rapidly. The fastest growing areas with respect to number of new users are now in Europe and East Asia[2], and are expected to close the gap in terms of sheer numbers (at least on a logarithmic scale) in the next three years or so.

It is not expected that every content provider will publish their Web pages in every conceivable language, or even more than one language. But several content providers have a chartered duty to reach large portions of the world, or at least the world's economy. This usually means support for at least ten languages, and as many as 36 or so languages. It is an expensive endeavor to translate Web-content into many languages. Therefore there usually must be an economic or political justification for the translations being done. Even though translation can be expensive, data acquisition and dissemination is relatively cheap. If customers enter their name in a Web page, they should be entitled to enter it, and subsequently display it in their own language and script. For data output, there should be no reason why bibliographic, geographic, linguistic, or personal information can't be displayed in its original language and script.

Generic international access problems

The Internet can be considered to be merely a huge communications system with a large number of interconnected computers on a network. Sounds simple enough, but once the data content of human-readable information deviates from the safe haven of seven-bits per character, a morass of problems seeps to the surface. Even though a number of standardization measures have been created through the Internet Engineering Task Force (IETF) to address these issues, vendors and providers have been lax to implement them.

The WWW is not just a virtual pool of homogenous HTML pages, but contains access to FTP sites, databases, e-mail, and directory services such as X.500 and PDLA. The problems associated with international Web content can be generalized into a few categories.

International access

The first and primary problem, that will probably be with us as long as humans exist, is the lack of understanding of other cultures and international issues. Each person is able to absorb only so much information, including languages. A designer will design content based on their understanding of how the world works, and hand it off to developers who then implement the designer's ideals. Europe appears to be better than other regions at creating multilingual Web sites, this being due to the many languages in close proximity to each other in the European Union. Many non-English sites throughout the world offer at least English as a choice in addition to the default language. This appears to work for now since a large percentage of multilingual people choose English as their second langauge (even a large population in America attempts to speak English with some regularity ;-).

For those information providers who truly desire a global presence, they must step out of their language-centric arrogance or ignorance in order to reach a larger population.

Search engines, for instance, will search exactly for what a user types in, sometimes, using fuzzy logic for stemming and extensions of the search. But, suppose one is doing research on Tchaikovsky. A normal search engine will only find English sites. Tchaikovsky is spelled differently in other languages based on the phonetic transliteration from the original, and of course is in Cyrillic in Russian. Hence, the search results may be large, but may not return pertinent information closer to the source.

Name lists

An example of the need for global data manipulation is with common names. Personal names. Geographic names. Organizational names. And so on. In each case, when a name is created, it is given in a particular language, with a particular writing and script. To limit all storage, manipulation and display of names to ASCII is analogous to limiting all English names to Greek or Cyrillic alphabets. But many systems, e.g. e-mail, limit data to 7-bit ASCII. This has neccessitated the creation of base64 MIME encoding for textual e-mail information that contains non-English content. MIME META tags can be used to indicate the encoding of web content, with the current default being Latin-1.

Multiple scripts

A worldwide mailing list must take into consideration the many accented characters, writing systems, and scripts in use throughout the planet to be truly user-friendly. These writing systems commonly fall into the following categories:

Multiple Scripts

European
Cyrillic
Greek
East Asian
Middle Eastern

Other examples where multiple scripts must be considered are with:

Geographic mapping
Books, publishing
Bibliographic data
Directory services

A final example is with personalized access. Once users log in, they should be able to automatically get information in their own langauge. For example, a global bank may have a Japanese customer, Mr. Kurazawa. No matter where he logs into the banking system from around the world, he should be able to be greeted with a "Konichiwa, Kurazawa-san," in Japanese.

Data encodings

Even if a name is correctly entered in one system for a particular script, it may be in a certain encoding that is different from that used by other users of the same information. This requires a user to have knowledge as to what the encoding is to be able to view and use it. The Internet is more or less standardized on a few character encodings, with Latin-1 (ISO 8859-1 [7]) the implicit default, but the data itself may come from many different sources. Macintosh character sets are different from IBM code pages, which are different from Unix and Windows encodings.

There are many places where a user has to adjust or convert the data to be usable. Some Web browsers allow a user to select the encoding being used, and a very few attempt to do a dynamic determination and conversion on-the-fly for the user.

Therefore, international access between communicating entities requires transparent codeset conversion.

Business and cultural rules

Once character sets, scripts, and languages are taken care of, the local business rules and cultural conventions for display of data need to be considered. Virtually no Web sites today address simple issues such as which character to use for a decimal point, formatting of dates, etc.

For instance, February 3, 1997 can be represented (generally) as:

Region

Format

Americas

2/3/97

Europe

3.2.97

Asia

97-2-3

These are not dynamically changed perhaps because they are considered to be part of the static content of the information. But with dynamic HTML, this must be considered. Did a package ship on February 3 or March 2?

In the same way, is 10.528 the same as 10,528? Between America and the rest of the world, there is a potential for three orders of magnitude of error. These cultural issues are being addressed in part by the European Standards body, CEN/TC304, which is responsible for the International Locales Registry[14].

Multilingual database problems

In the same way that generic systems, graphical interfaces, and Web content suffer from problems with respect to international access, relational database systems have their own set of problems.

Proprietary encodings

Relational databases have been around for 15 years or more now--veritable dinosaurs compared to the WWW. With legacy mainframe data, the age is even greater. International (non-English) data is generally a bit younger than English-language data, with legacy data in a wide range of proprietary encodings that evolved before the concept of the LAN. Sometimes this data is all jumbled into a single database, with baroque access controls to determine what encoding is in what row or what column of a data table. Usually, data is stored in only one encoding per table or per database. Oracle supports over 150 different character sets, any one of which can be used as the default encoding for data stored in a database.

The differences between encodings of stored data must be resolved before it can be shared on an intranet or the Internet.

Proprietary protocols

Each database vendor, when they started, embarked on creating their own proprietary protocol for communication between database server and client. There is Open-Client, SQL*Net, ODBC, and OLE DB. The standards bodies have come up with RDA (Remote Database Access), but ODBC has become a de facto standard for most applications. The trouble is, ODBC is oriented around a single manufacturer's needs, and does not address the issues of accessing legacy data.

To make heterogeneous configurations more manageable, "middle-ware" has been developed to provide a single common interface with access to multiple protocols and standards. Some examples of this are the Sybase OpenConnect and OmniServer products that allow interacting with a server in a single SQL language, with transparent access to many vendors' databases. But each vendor has chosen a different, or sometimes no, method of identifying the character encoding and cultural norms of the database being accessed.

The figure below shows a schematic of common database internal architectures that can be used amongst several database vendors if a global system based on Unicode is desired[6].

Conversion and SQL_TEXT

Note that this only shows the potential similarities between systems, if they are configured appropriately. Most are still using a wide range of mutually incompatible character sets. What it does show is that each vendor usually has some form of conversion mechanism for converting the data from one character set to another. This can be used in solving some of the international access problems encountered with databases and the Web.

Dynamic HTML generation

The power of databases becomes strikingly apparent when linked with application servers that can produce dynamic HTML, based on the data content and access profile. This is being used today to serve up tailor-made news, deliver stock quotes, bank information, catalog search results, phone book results, etc. One missing item this author was unable to find was machine-generated HTML originating from a database back-end that correctly identified its encoding if different from ISO Latin-1.

Solutions

By keeping track of encodings and languages, a developer can normalize to a universal character set (Unicode) for increased integrity in data transmissions. This encompasses three distinct stages: encapsulate, normalize, and standardize.

Encapsulate

In the encapsulation phase, the data within a database is isolated from the outside world via a "conversion envelope." The receiving application is also encapsulated and isolated from incoming data by way of a conversion envelope. Most Web browsers today use this concept to let the user choose the document encoding. Ideally, only one set of fonts and display drivers needs to be used to increase the flexibility of the system, and allow data from multiple sources to merge into one client application.

Normalize

Once encapsulation has taken place, the data can then be normalized into a single known character set. This is ideally Unicode, using different encodings of the Standard for various solutions doing conversions as necessary at different ends of the Net.

Standardize

With the data normalized, it can be easily transformed into a single standard format. For HTML that format is becoming Unicode in the UTF-8 or UCS-2 encodings. This multitiered approach allows the use of existing data, from multiple sources, to be used by clients simultaneously in a flexible manner.

Current status

This section takes a quick look at the current state of affairs for the development, support, and deployment of multilingual SQL-based WWW applications. First, the availability of technology, which makes such endeavors feasible, will be looked at. Then, the known state of actual multilingual deployments around the world.

Available technology

Operating environments

Unicode can now be used in operating systems from Microsoft, IBM, DEC, Sun, and Apple. Microsoft Windows NT uses Unicode as the default character encoding in the OS; Windows 95 has most of the Microsoft Unicode API in the MFC foundation class libraries. IBM, DEC, and Sun offer Unicode in varying degrees of implementation, ranging from a full Unicode development library provided with IBM AIX, to UTF-8 multi-byte and UCS-4 wide-character support in Solaris, following the POSIX model.

There are third party add-ons which can be used to add a Unicode interface to an otherwise plain GUI. Systems from Star+Globe in Singapore, Gamma Productions in San Diego, California, Accent in Tel Aviv, and Plan9 from AT&T all add the ability to display and enter Unicode characters. With new fonts being created, Windows NT can be used for more regions than are usually available from Microsoft.

Web technology

HTML

A tagging and encoding scheme for Unicode has been agreed upon by the IETF which enables its use for MIME encoded e-mail and HTML. HTML 3.0 is proposing to establish Unicode as the reference character set for future Web pages[9, 10].

Browsers

At the time of this conference, Netscape 4.0 should be available, which can use the Unicode fonts and input methods of Windows NT. Alis Technologies has the Tango! browser which supports virtually all business languages worldwide for both display and input purposes. The Tango! browser uses the Union Way input method editor for Asian (Chinese, Japanese, and Korean), and can be expanded further with products from Union Way. Accent's Multilingual Mosaic is designed around Unicode. The Microsoft Front Page HTML editor supports Unicode HTML as well.

Java

The Java language was designed from its inception with Unicode in the UCS-2 encoding as the only type for character strings. Conversion functions are specified to get in and out of other character sets or encodings. With the completion of the internationalization specifications in 1996, it now provides a robust portable internationalized application language that can support dynamic cultural formatting of strings and data from any language[8].

VRML

VRML 2.0 uses UTF8 as the primary encoding.

Databases

Many database vendors are actively pursuing the Internet marketplace by providing database connectivity, search engines, multimedia, and developer tools. After analyzing the problems presented above, many of them came to realize that an integral part of the global solution is to provide Unicode solutions.

Unicode is available for use in database systems from Sybase, Adabas, Informix, Interbase and Oracle.

The diagram below shows different ways in which database vendors have resolved or are resolving the aspect of normalizing data encodings for global access. The small vertical boxes represent conversion engines, either on the client or the server. With these scenarios, it is possible to get Unicode in and out of virtually any of the production databases available today. The italicized entries are due out in future releases, and refer to new Unicode datatypes that will allow legacy systems to be upgraded and enabled for Unicode data manipulation alongside existing encodings.

ODBC

The ODBC 3.0 standard interface provides a new Unicode datatype, allowing seamless interfacing between Unicode applications and Unicode data stores on a SQL RDBMS.

JDBC

The new JDBC standard provides a portable Java-based direct interface to database servers. Part of its specification is the use of Unicode strings with binding to Java character types, which are encoded in Unicode.

Actual use of global technology

With the enabling technologies of Java, JDBC, ODBC, Unicode datatypes, Unicode SQL Servers, and Unicode GUIs, it would seem that there would be a glut of global-ready applications. But the technology is still new and the incentives are relatively low for the typical information provider publishing in single languages.

Probably the most active area for global-ready content at present is in the bibliographic and government sectors. The European Commission is beginning to request Unicode-enabled application design tools and Unicode-enabled databases from the software providers who supply them.

As portable, globalized Java applications become more prevalent, Unicode in HTML and other Internet content will rise as well.

A major technical problem will be in the area of normalizing the data used by search engines. Without properly tagged content, the database used by the search engine will continue to be a mix of encodings. This will require some work to allow for universal searching for concepts and ideas.

Future prospects

Unicode in databases

The first step we will most likely see is the use of Unicode in online databases as used by global information content suppliers or government information systems such as the European Union. In order to make the data available to a reasonable minimum audience, conversions may take place from the encapsulated data to acceptable character set defaults. This step of encapsulation may itself be skipped if Java continues to speed ahead as it is doing now.

Unicode in middleware

Middleware will then be used to take the normalized data and provide connectivity to legacy data stores and older GUIs or browsers.

Unicode in browsers

In this next year, we will see Unicode become commonplace in the GUI environment for most Web Browsers. This will enable end-to-end portability of data without regard to the original encoding of the data itself.

Leveraged change in global applications

As Unicode-enabled, and therefore globally ready, WWW, Java, and database applications become available for general consumption, we should see a ground swell of additional support from smaller providers. Eventually, all Web content should be properly tagged so that it can cooperate with the next generation search engines, Web browsers, Java applets, and database integration schemes.

Conclusion

This paper has shown that there are several issues and problems associated with creating effective multilingual database connectivity with the Web. It has also shown that solutions exist today to solve these problems. Developers should not be discouraged to pursue creating a global-ready application or Web site using Unicode. In fact they should be encouraged to do so if they desire to be ahead of the curve.

References

Erik Huizer, SURFnet ExpertiseCentrum, Some Internet growth data, 16 April 1996
Mark K. Lottor, Network Wizards, Internet Domain Survey, 17 January 1997
M. McKenna, "Converting a Multinational Software Company to Unicode: A Case History to Date," Proceedings of the Tenth International Unicode Conference, Mainz, Germany, 1997
M. McKenna, "Business Solutions with Multi-Lingual Databases and Internet Services: World-Class Systems for the WWW," Proceedings of INET'96
M. McKenna, "Business IT Infrastructure Issues in a Global Environment," Proceedings of the 1996 International Sybase User Group European Conference, Barcelona, 1996
M. McKenna, "Unicode in World-Class Database Systems," Proceedings of the Tenth International Unicode Conference, Mainz, Germany, 1997
Michael K. Gschwind, ISO 8859-1 National Character Set FAQ, Version: 2.9889, 3 February 1997
JavaSoft, Sun Microsystems, JDK 1.1 Internationalization Specification, 4 December 1996
Gavin Nicol, The Multilingual World Wide Web, 16 October 1996
W3C, Internationalization/Localization, Non-Western Character Sets, Languages, and Writing Systems, 16 September 1996
S. Buchta, "National Language Support and Unicode in Relational Databases and SQL," Ninth International Unicode Conference Proceedings, Part 2, Unicode Consortium, 1996, San Jose, California
H. Yoshioka and J. Melton, Character Internationalization in Databases: A Case Study, Digital Technical Journal, vol. 5, no. 3 (Summer 1993): 80-96
The JDBC(tm) Database Access API, Sun Microsystems, 1996
CEN/TC304 Secretariat, CEN/TC304/WG2 Cultural Elements, 6 December 1996

Multilingual Databases and Global Internet Integration

Abstract

Contents