INET Conferences




Other Conferences

[INET'98] [ Up ][Prev][Next]

Web Internationalization and Java Keyboard Input Methods

LEONG Kok Yong <>
LIU Hai <>
Oliver P. WU <>
National University of Singapore


Internationalization (I18N) has gained much momentum in recent years. Even the latest HTML 4.0 public draft has taken great strides towards the internationalization of document, with the goal of making the Web truly "World Wide."

This paper starts by describing the various developments in HTML standards that have made the Web a more global definition. It then points out the fact that the correct display and rendering of multilingual text is only half the scenario for I18N. Users not only wish to view I18N HTML documents, but they also want to create them! Keeping this fact in consideration, the paper then goes on to explain why Java has not completely fulfilled its role as an I18N development platform, especially in the area of native keyboard input methods.

To meet this shortcoming, this paper explains how the development of a Java Input Method Engine (JIME) fills the gap. It continues with a description of the design issues and implementation of the framework -- an applet, a Netscape Composer plug-in, and a Unicode-based multilingual text editor. It ends with an account of the ongoing development of JIME.

In conclusion, it would be ideal if Java had included full native keyboard input methods support in the core APIs. An early preview of JDK 1.2 sees an input method being introduced but perhaps only the next iteration of Java releases may offer full input method support regardless of the locale of the host platform.


1. Introduction

The integration of Web and i18N is well under way, especially with the W3C's recommendations for HTML 4.0 [2] working draft. Among the i18n features well incorporated from RFC 2070 [3] into HTML 4.0 include the use of ISO-10646 (Unicode) as the document character set for HTML, the <lang> tag for specifying the language of content, the <dir> tag for specifying the direction of text, tag for specifying charset in the HTML header, etc. With such enhancements, the Web will be able to broaden its reach to more corners of the Internet.

Although the majority of Web used to be dominated by English HTML documents using the ISO-8859-1 character set, HTML documents are increasingly written in many other native languages and encoding as well. Riding on this trend, major browsers like Netscape Navigator and Microsoft Internet Explorer have included support for viewing i18n HTML documents given that the appropriate fonts are installed. The creation of such documents is also possible, although not without some level of difficulty as we will explain later.

1.1. Towards a more internationalized HTML

The Hypertext Markup Language (HTML) is a markup language used to create hypertext documents that are platform independent. In the beginning, the use of HTML on the World Wide Web was seriously confined to the ISO-8859-1 character set. This only works well for Western European languages. Nevertheless, HTML is also widely used with other languages using other character sets and encoding at the expense of interoperability.

Prior to HTML 4.0, internationalization features are evidently missing from HTML 2.0 and HTML 3.2 [1]. There are no tags to specify character set of the document (the default charset is ISO8859-1). Neither are there tags to indicate the text direction, which is especially important for right-to-left writings like Arabic and Hebrew.

In the days of HTML 2.0 and 3.2, users encountering multilingual HTML documents while browsing the Web may be forced to do some guessing to determine the character set on which a particular HTML document is based. The most intuitive guess would most probably be based on the top-level domain of the Web site. For example, if the domain is .jp, the HTML documents would most likely be in one of the Japanese encodings -- EUC-JP, SJIS or JIS. However, it is also possible that the HTML document could be in one of those double-byte encodings like Chinese GB or Korean KSC or maybe even other character set. Another approach is the HTML document could contain an English statement informing the user what character set is used. The user would, upon reading it, switch the browser to the appropriate character set viewing mode.

Along the way, some browsers like Netscape Navigator and Microsoft Internet Explorer begin to support the use of a FACE attribute to the FONT tag. With this, HTML authors can specify a particular font to use to view a certain part of the text within a HTML document. This proves especially useful for 8-bit character set. However, the use of the FONT FACE [4] is considered harmful in many instances since we should not indicate which font set to use to display a document. Instead, we should specify the character set on which a particular HTML document is based.

With HTML 4.0 public draft, W3C introduced international capability into the Web. Browsers, including Netscape Navigator and IE, can recognize the character set specified in the HTML document header. If the user has provided the appropriate font settings for each language encoding supported, the browser will be able to automatically display the HTML document in the specified language encoding.

2. I18N and Java

Since its inception, Java has been promoted as a cross-platform development language. It was also designed to be a language to support i18n applications right from its initial design. It began with support for Unicode strings in its early version 1.0, though this version couldn't really do much for i18n.

The subsequent release - version 1.1 - [5] added more support to allow for the development of localizable applets and application using Java. Enhancements include the display of Unicode characters, a locale mechanism, localized message support, locale-sensitive date and time, time zone and number handling, collation services, character set converters, parameter formatting, and support for finding character/word/sentence boundaries. This is a large step towards i18n in Java. The ability to add fonts to the Java runtime environment make it possible to display Unicode characters. Locales and related services also make it possible to write your application once and port them later to other language contexts through the use of resource bundles. The character set conversion (to and from Unicode) utilities made interchange between current widely used encoding and Unicode quite effortless.

However, one very important missing feature in Java version 1.1 is the provision for native keyboard input methods.

3. Java and keyboard input methods

3.1. Why do we need Java input methods?

Without native input methods, localized applications cannot accept non-Roman keyboard input from users. A 'true' localized Java application should not only be able to display localized text; it should be able to accept localized character input from users.

Languages that do not use the Roman alphabet require special keyboard mapping to achieve character input. If you are familiar with Chinese, Japanese or Korean (abbreviated as CJK), you'll understand that inputting these characters is not easy using an English keyboard layout. You need a keyboard manager to trap your keystrokes sequence before transforming them into valid characters; some methods require the user to choose from a list of choices. This applies to other languages as well. For example, some Indian languages have a phonetic keyboard layout that is much more complex than direct keyboard mapping. Thai is another language that needs a remapped keyboard in order to be typed.

In many instances, a complete GUI application written in Java will need to accept character input from users through widget-like text fields. Currently, such a text-entry mechanism is largely based on Roman characters (English keyboards) only. The initial core Java API's framework did not support keyboard input methods for other languages and writings. Without input methods support in Java, applications cannot accept non-Roman keyboard input from users.

3.2. JDK 1.2 and input method framework

In the beta release of the JDK version 1.2, an input method framework [8] is built into the core Java platform. Based on information extracted from the JDK 1.2 documentation, "the only input methods supported are native input methods integrated with the host input method managers. These are - the Input Method Manager on Win32, the Text Services Manager on Mac OS, and XIM on Solaris. The host input method adapter plays the role of an input method within the Input Method Framework, and translates events and requests between the data models used by AWT and the Input Method Framework on one side and the host's input method manager on the other side."

Because all AWT widgets are peer components, they rely on the host platform widget's functionality. As such, if the Java application runs on Japanese Windows95, there will be Japanese input methods but not Chinese or Korean input methods and vice versa. A Java application running on English platform will not enjoy any other input methods support except the U.S. English keyboard. In short, Java applications will receive partial input method support from the host platform in JDK 1.2. Perhaps, JFC (Java Foundation Classes) may overcome this restriction but most probably only in the next Java release.

4. JIME: Java Input Method Engine

To this end, we focus on developing a Java Input Method Engine (abbreviated as JIME) to allow Java applets and applications to accept non-Roman character input.

JIME is being developed on many different fronts. When we started our implementation, due to limited support by browsers for JDK 1.1, we started with 'jInput', a Java applet using JDK 1.0. To extend its usefulness, we ported it to a Java plug-in for Netscape Composer. Both applet and plug-in support various input methods for Chinese, Japanese and Korean.

Along the way, Netscape announced the finalization of JDK 1.1 support for Netscape 4.0 with a patch. As JDK 1.1 support improves, we are re-focusing our development effort toward JDK 1.1 and re-deploying our framework to make use of better i18n support from JDK 1.1. Support for more languages like Thai, French and German are being added.

In the sections that follow, we start with introducing our design and implementation of the JDK 1.0 model of our development before we move on to describe the subsequent JDK 1.1 model design issues and implementation we adopted.

5. Design and implementation (based on JDK 1.0)

5.1. Design issues

5.1.1. Input methods

The input method mappings from user keystrokes to the corresponding characters codes is implemented as a very simple 2-dimensional lookup array of strings. Example 1 below shows the source for the mapping for the PinYin input method based on GB encoding.

Example 1: Extract of source for PinYin method in GB encoding

static private String[] keys = { 
"ao",........};static private String[] mappings = { 

For example, when the user keys in the character "a," the corresponding displays for the user to select are 0xb0a2, 0xb0a1, 0xbac7, .... etc. If the subsequent key pressed is "n," then the characters' selection range will become another set -- 0xb0b2, 0xb0b8, 0xb0b4, 0xb0b5, 0xb0b6, .... etc.

5.1.2. Java Bitmap Font

Although JDK 1.1 supports the use of host fonts, JDK 1.0 does not. To maintain backward compatibility with Netscape 3.0, we designed a bitmap internal font for the applet. It's quite efficient and compact, and very similar to HBF (Hanzi bitmap format) [10]. Example 2 below shows the Java bitmap font for simplified Chinese based on GB encoding.

Example 2: Extract of source for Java bitmap font for simplified Chinese in GB encoding

static private final String[] bitmap = { 

We use a 16-by-16 bitmap for each CJK character. So the first 16 characters in the string above is the hex bitmap for the first character 0xA1A1 in GB charset. The subsequent 16 characters is for 0xA1A2 and so on.

5.2. Implementations

5.2.1. jInput: Java applet

Many search engines around the world are capable of indexing and searching for keywords in languages using double-byte character set, for instance Chinese, Japanese and Korean (CJK). To cite just a few of them, they include GoYoYo, Yahoo and AnySearch

Usually, the users are expected to find their own ways to enter CJK characters into the text field in the HTML form to search for keywords. This does not pose a problem to those users on a native platform, but users on English platform or other locale platforms will need to install their own third-party applications to input CJK text. Many third-party applications for Windows are available, usually running as a keyboard manager. But Macintosh has limited such applications. Moreover, such third-party applications for Windows and Macintosh are usually either commercial ware or are available on a trial basis with expiration dates. There are various developments of IME servers for UNIX systems, but these are not easy for novice users to set up. Users thus experience much inconvenience just to input a few characters for searching.

To assist those users on non-CJK locale platforms, we developed a Java applet - jInput - allowing the user to input CJK text without having to install any third-party applications. Web sites adopting this applet do not need elaborate instructions for users on installing third-party applications in order to input CJK text.

The advantage of this approach is the user is not expected to install any keyboard manager on his or her system. The user simply waits for the applet to be downloaded and enters the text into the Java applet. Before submitting the form, JavaScript will call a public method in the Java applet to retrieve the text content from the applet. Netscape LiveConnect [6] technology allows JavaScript to call methods in Java classes. In this way, the applet works seamlessly with the HTML form as if it is a plain-text field. Both Netscape 3.0/4.0 and IE 4.0 currently allow such JavaScript-to-Java communication.

The applet currently supports the following language encoding and input methods.

  • Chinese GB2312 with PinYin and CangJie methods.
  • Chinese Big5 with PinYin, CangJie and Simplex methods.
  • Japanese EUC-JP with RomanKana and TCode methods.
  • Korean KSC with Hangul and Hanja methods.

For a demo of the jInput applet, go to See below for a screen shot of the applet.

Figure 1: Screen shot of jInput applet

5.2.2. JIMEPlug: Netscape Composer plug-in

Netscape Composer 4.0 allows developers to write plug-ins (in Java) [7] to extend the functionality of the HTML editor. Composer currently allows users to view multilingual text in its editing window, given that the required fonts are installed and appropriate settings configured. Unfortunately, it does not provide a localized keyboard to the user to edit the multilingual text being displayed; the host platform is expected to provide the input method manager.

The above-mentioned applet is ported to run as a Netscape Composer plug-in - named "JIMEPlug". This is especially useful for users on English or other non-CJK locale platforms but wish to have the i18n capability. If you set your Netscape Messenger to send HTML e-mail message, you can even type your e-mail message in CJK with the help of the plug-in.

As with any plug-in, installing this plug-in involves just simply downloading a ZIP file to the plug-ins directory of where the Communicator application is installed. The plug-in is a self-contained unit with its own fonts and keyboard input methods supporting various Chinese, Japanese and Korean input methods and encoding. Because Netscape Composer uses Unicode for its internal representation of characters within an HTML document, authoring of CJK documents in Unicode is also possible.

A beta prototype of the plug-in is available at

6. Design and implementation (based on JDK 1.1)

6.1. Design Issues

Our early prototypes mentioned above were done with the aims of ease of use and compact file size in mind. We focused on making the Java bitmap fonts and input methods classes compact to minimize the download time. However, a few shortcomings are inherent. One aspect is the input methods and bitmap fonts are based on individual native encoding. This enables each applet to operate well and efficiently when standing alone. Combining these input methods and bitmap fonts did not work as well as one single entity because they are based on different encoding. As such, in the next version of our development, we will realign our effort and make some improvements.

  1. We make use of Java 1.1, which offers several advantages over 1.0. The new event-handling model in JDK 1.1 is more flexible and ensures us of an easier porting path to turn our work into JavaBeans. Making use of host fonts to display Unicode characters is now possible with Java 1.1. If the target platform has the appropriate Unicode font installed, rendering of multilingual text with different sizes is much easier.
  2. All input methods' mapping definitions are now based on Unicode, instead of individual native encoding (e.g. GB, Big5, JIS, etc.). The characters mapped from a user's keystrokes are all in Unicode. Now, when working with multiple languages, we do not need to perform redundant conversion between different character sets unless we need to export the text content in a particular native encoding.
  3. The simple 'table lookup array' implementation of the keyboard mapping is also replaced with a more efficient and compact 'tree' implementation. On average, the various input methods mapping classes for CJK benefit from a 40-60 percent decrease in file size with the 'tree' implementation.

Although JDK 1.1 offers significant advantages over 1.0, it is deficient in other ways and we designed our JIME framework to try to address these inconvenience.

  1. A single consistent font interface and convenient font utilities for multiple languages. Java 1.1 does not yet allow you choose a font of a given encoding, or find out what range of the encoding a font is capable to render.
  2. Because different Java virtual machines may shipped with or without the Sun packages in JDK 1.1, we have resorted to writing our own converter class instead of relying on* classes to convert between different character sets and Unicode.

In addition, JIME design tries to overcome JDK 1.2 initial support for only input methods from the host platform. JIME provides various input methods / keyboard input for languages other than U.S. English, regardless of the host platform on which the Java application is running. For instance, your Java applications will still get Japanese input methods with JIME even when the Java applications are running on a Chinese locale host platform.

JIME consists of five packages:

  1. jime.font - it contains typeface implementation to make use of both the Java host system font and the bitmap font we designed for JDK 1.0 (compiled as Java classes), and provides one consistent interface for the user to make use of all kinds of typefaces.
  2. jime.fontlib - this package holds all the glyphs of the bitmap fonts.
  3. jime.ime - this package deals with keyboard mappings and input methods. Generally, the input methods are classified into two classes: direct input and over-the-spot input. Direct input covers keyboards like Thai and most Western European languages. Over-the-spot input covers Chinese, Japanese, and Korean keyboard input methods that require a pop-up window to let user select the characters.
  4. jime.imelib - this package holds the mapping tables of the various input methods.
  5. jime.widget - this package, as the name implies, contains necessary components to draw strings, texts, and also layout controllers to layout components in a clean and flexible way. It also provides auxiliary widgets, such as button, pull-down menu, and over-the-spot window, etc.

Figures 2, 3 and 4 on the following pages give a structural overview of JIME packages. JIME architecture focuses on the enabling input method support in JDK 1.0 and 1.1. The jime.widget components are written to make use of the jime.ime and jime.font libraries.

Figure 2: jime.font and jime.fontlib structure

Figure 3: jime.ime and jime.imelib structure

Figure 4: jime.widget structure

6.2. Implementation

The Java applet and the Netscape Composer plug-in are re-deployed using a JIME base on Java 1.1 code. In addition, to further illustrate JIME's flexible multilingual framework, a multilingual text editor - JIMEWord - is implemented. Its basic multilingual features include:

  1. saving and loading of Unicode UTF-8 or UTF-7 encoding files, since Unicode is used for internal representation and processing. Saving/loading of other native encoding is also supported via code conversion routines from Unicode to the target encoding.
  2. support for display and input methods of Chinese, Japanese, Korean, Thai, French, German and many more.
  3. user-friendly graphical keyboard for ease of typing. This helps if a user is using a U.S.-English keyboard device and wishes to input French, for example. He/she can use the mouse to click on the keypad on the graphical keyboard for typing French.

A screen shot of JIMEWord with the floating graphical keyboard window displaying the Thai keyboard mapping is shown in Figure 5 below.

Figure 5: Screen shot of JIMEWord with the floating graphical keyboard

7. Problems and limitations

While carrying out tests of the applets (Java 1.0), we noticed some differences in the Netscape Navigator implementation of the Java Virtual Machine on different platforms and versions. Navigator 4 for UNIX seems to cause a hidden applet (i.e., with zero width and height) to obscure the entire page; so we need to devise a workaround using a table to prevent it from covering up the visible applet. Netscape 3.0 for Windows95/NT also suffers from a similar problem, but resizing the window after the applet finishes loading causes a correct refresh. Internet Explorer 4.0 does not suffer from such problems.

Because of the complex nature of internationalization, it is not easy to get a perfect design. JIME is a good try because it does strive to meet its objective, and has an extensible structure. However, there will definitely be some limitations along the way. Currently, JIME API does provide extensible space for bi-directional horizontal text layout and edition, because they are all left to individual StringViews to handle. No major changes are required other than just implementing another bi-directional StringView type into the BlockView. There might be some important things particularly for bi-directional layout that are left out, however, and we won't know until a bi-directional StringView is on the way.

The classes in jime.widget package do not aim to rival Java2D API and JDK 1.2 advanced text layout features [9]. JIME framework design is focused on providing full-input methods support to JDK 1.0 and JDK 1.1 applications given the unique feature of jime.imelib and jime.fontlib package.

8. Ongoing/future developments

Java support from browsers is not consistent. Old browsers like Netscape 3.0 and IE 3.0 support only Java 1.0. Netscape 4.0 with a JDK patch supports Java 1.1. On the other hand, IE 4.0 has many proprietary extension and modification to its Java implementation. Because of this inconsistency, the newer features in our development work based on JDK 1.1 cannot be shown on old browsers. To work around this and provide backward compatibility, some wrapper code is required.

We plan to make a JDK 1.0 applet that runs on all Java-enabled browsers (whether it contains a 1.0 or 1.1 Java VM). Assuming the host system has the appropriate native fonts installed and Netscape is configured to make use of them, the applet will try to use these fonts if the browser is JDK 1.1-enabled. If either the native fonts are missing or a JDK 1.1 VM is not present, the applet will fall back to use our jime.fontlib packages' bitmap font classes. The wrapper applet should be able to dynamically load the correct Java code base based on the situation described above.

To make the JIME code base and framework reusable, we are in the process of porting it into JavaBeans. With JavaBeans, software developers can easily reuse JIME components and build native keyboard input methods into their Java 1.0/1.1 applications regardless of the locale of the host platform it will be running on.

To increase JIME support base, its extensibility has to be enhanced through adding support for more languages to its portfolio. We are extending the framework to include more European languages, Indian languages (like Hindi and Tamil) and maybe even bi-directional writings like Arabic and Hebrew.

9. Conclusion

Java 1.2 input method framework is a step in the right direction. Unfortunately, only input methods supported by the host platform's native input method managers are available to Java applications. However, with Java 1.2 support of Java Foundation Classes (JFC), AWT peering widgets are being complemented by JFC peerless components. Because JFC widgets are lightweight stand-alone components, they do not rely on the host platform widgets' functionality. As such, it is expected (according to the Java 1.2 input method framework documentation) that future releases of Java and JFC may provide full input method support regardless of the host platform the Java application is running on.

In the meantime, JIME serves as a good transitional component for JDK 1.0 and 1.1 (or even 1.2) developers who need the native input methods support for their Java applications, especially since Web browser support for the latest Java VM do not catch up as fast as Javasoft's JDK releases.

In conclusion, the Web is moving towards a more 'World-Wide' reach and so is Java. With Java, we are close to realizing true internationalization of cross-platform applications. Java Input Methods will make your localized applications more complete.

10. References

  1. Dave Raggett, HTML 3.2 Reference Specification
  2. Dave Raggett, Arnaud Le Hors, Ian Jacobs, HTML 4.0 Specification - W3C Working Draft
  3. F. Yergeau, G. Nicol, G. Adams, M. Duerst, RFC 2070 - Internationalization of the Hypertext Markup Language
  4. <FONT FACE> considered harmful
  5. JDK 1.1 Internationalization Specification,
  6. Netscape LiveConnect
  7. Netscape Composer Plug-in Guide
  8. JDK 1.2 Beta2 Documentation - Input Method Framework
  9. IBM's Java Education - "International Text in JDK 1.2"
  10. Hanzi Bitmap Format (HBF)


Leong Kok Yong
Internet R&D Unit, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260

Leong Kok Yong is the principal researcher in the i18n group of the Internet R&D Unit (IRDU) of the National University of Singapore. He has worked on multilingual development work since 1995, with focus on the World Wide Web and Java.

Oliver P. Wu
Institute of Systems Science / Kent Ridge Digital Laboratories, Heng Mui Keng Terrace, Singapore 119597

Oliver WU was formerly attached to IRDU as a student researcher, working on the very early design and implementation phase of JIME. He has since joined the BioKleisli research group of the BioInformatics Centre (BIC) of the National University of Singapore. He is currently a senior software engineer at the Kent Ridge Digital Laboratories.

Liu Hai
Department of Information Systems and Computer Science, Faculty of Science, National University of Singapore.

LIU Hai is a student researcher working with IRDU i18n group. He is completing his undergraduate degree course on Information Systems and Computer Science in the National University of Singapore. After working on JIME, he was given an opportunity to be attached to Netscape Communications Corp. for a summer internship program for 3 months in May 1997.

[INET'98] [ Up ][Prev][Next]