/td>

Character Integrity Issues in NLS Environment

This note describes a character data translation issue in an NLS environment.

Abstract

Oracle JDBC drivers perform character set conversion as appropriate when character data is inserted into or retrieved from the database, i.e. the drivers convert Unicode characters used by Java clients to Oracle database character set characters, and vise versa (described in "Using NLS" in Chapter 5, "Advanced Topics", in the JDBC Developers Guide). Java runtime systems provided by various vendors (Sun, Netscape, Microsoft) convert operating system characterset characters into the Unicode characters used by Java and vise versa. Due to differences in the character set conversions by JDBC and the Java runtime systems, character data making a round trip from the operating system character set to the Java Unicode character set to the database character set and back to Java and then back to the operating system can suffer some loss of information. This is not a bug in either Oracle JDBC or the Java runtime, but rather is an unfortunate side effect of the differences in the specifications of these two systems. Fortunately this problem affects only a small number of multibyte characters. The workaround is to avoid making a full round trip with these characters. Additionally, even when the round trip between the operating system and the database works perfectly, accessing character data via some other path may yield different results than when the data is accessed via Java and JDBC. This problem also affects only a small number of multibyte characters. The workaround is to avoid accessing those characters through both Java and other paths.

The Problem

Character data translation errors can occur when the character data is converted multiple times based on multiple mapping tables which are not identical. This can happen wherever such multiple translations occur, regardless of configuration or software in use. For example, a piece of text data input into a Java application to be stored into database will go through character set conversion twice. The first one will happen when the text goes from the operating system to the Java runtime to convert to Unicode from the character set used by the front end layer. This conversion could be performed by your browser, underlying operating system, a GUI widget or virtual machine depending on the architecture of system. The second will happen when storing the text into database as the text is converted to the database character set.

There are two problem scenarios:

1) Consider the following character mappings:
 

from os to Java from Java to os
o1 j1 j1 o1
o2 j2 j2 o2
o3 j3 j3 o3
from Java to DB from DB to Java
j1 d1 d1 j1
j2 d1
j3 d2 d2 j3

The user enters character o1 into the operating system. The Java runtime translates that into Java character j1. If the user then reads the character the Java runtime will translate j1 back to o1. If the user stores the character into the database, JDBC will translate j1 into d1. If the user reads the character JDBC will translate d1 back to j1 and then the Java runtime will translate j1 back to o1, the character the user entered initially. This is the case for the vast majority of characters.

But consider what happens if the user enters a problematic character, o2. The Java runtime will translate o2 into j2. JDBC will translate j2 into d1. On the return trip, JDBC will translate d1 into j1 and the Java runtime will translate j1 into o1. The user entered o2 and got back o1. Both translations worked exactly as specified. It is only the unfortunate interaction between the two translations that caused the problem. Changing either translation table would break a large number of existing applications.

2) In addition to the above character mappings, consider the following:
 

from os to DB from DB to os
o1 d2 d2 o1

Using SQL*Plus or some other mechanism to access the database other than Java the user enters the character o1. That character is translated to d2 for storage in the database. Then the user accesses the database via Java. JDBC translates d2 into j3 and the Java runtime translates j3 into o3. In this case the user entered o1 and got back o3 even though o1 makes a correct round trip via Java/JDBC or via SQL*Plus. It is only by combining Java with another mechanism that the translation problem occurs. Again, each of the translation tables is working as specified, it is only the interactions of the tables that cause the problem.

The only solution to these problems is to avoid the problematic characters or to avoid making the kinds of round trips that cause the problem. Changing the mappings is not a possible fix as the the mappings are working as specified and many existing applications depend upon those specifications.

There are many definitions of character sets and mappings provided by various organizations. Consistency in data exchange is maintained by equivalent character sets and mappings, but in the real world there are a few exceptions. In our growing network of heterogeneous systems, the absence of truly standard definition could lead to the possibility of translation errors becoming ubiquitous. In practice it is very difficult to resolve this issue as changing the present mismatched translation definitions breaks existing working applications. Oracle is working with other vendors and the appropriate standards bodies to address these issues. Due to the complexity of the problem and the large number of products and systems that rely upon the existing character sets and mappings, solutions will not be easy to find.

Problem Characters

This section summarizes by character set the result of a character integrity test using Oracle JDBC drivers. Japanese Shift-JIS and EUC character sets have been tested. A few Japanese characters(out of thousands) were found to be corrupted, while all the other characters retain integrity. Please note that the information herein does not intend to make up a part of the specification. It is provided for informational purpose only and the findings are subject to change.

Shift-JIS (JA16SJIS)

  • For those 5 characters whose Oracle mapping to SJIS is 8148, successful round trip conversion can be achieved only between Java and the execution environment, and not all the way to the database and back. If you insert them to Oracle you will only get question marks when you fetch them back.
  • In addition if you are going to use IE or Microsoft virtual machine 5 additional characters(that have questions in MSFT column) are turned into question mark in MS Java virtual machine (Tested on build 2924/SDK for Java 3.1.)
  • Double byte ampersand cannot be used with the Microsoft Java virtual machine due to round trip conversion failure to and from Windows.
  • The same set of characters fail against EUC and JA16EUC databases.

  • Table: Conversion Mapping Inconsistency for SJIS

    Unicode : Java  ORCL  MSFT  Character Name
       ---- : ----  ----  ----  ---------------------------
       00A2 : 8191  8148   -    CENT SIGN
       00A3 : 8192  8148   -    POUND SIGN
       00AC : 8198  8148   -    NOT SIGN
       2016 : 8161  8148   -    DOUBLE VERTICAL LINE
       2212 : 817C  8148   -    MINUS SIGN
       301C : 8160  8160   ?    WAVE DASH
       309B : 814A  814A   ?    VOICED SOUND MARK
       309C : 814B  814B   ?    SEMI-VOICED SOUND MARK
       FF06 : 8195  8195   X    AMPERSAND
       FF9E :   DE    DE   ?    HALFWIDTH VOICED SOUND MARK
       FF9F :   DF    DF   ?    HALFWIDTH SEMI-VOICED SOUND MARK
  • 8148 is a replacement character(question mark) which appears when there is no mapping for the character.
  • EUC (JA16EUC)

  • For the listed 16 characters whose Oracle mapping to EUC is FF1F, successful round trip conversion can be achieved only within VM, and not between VM and database. If you insert them to Oracle you will only get question marks when you fetch them back.
  • G3 characters cannot be used against Shift-JIS database. If used, you will see question marks after server round trip just like the listed 16 characters. This means that if you are going to use the full set of EUC including G3 set on clients you must use EUC database. This is due to the limitation of the number of characters which can fit into the code range of Shift-JIS character set.
  • Cent sign, pound sign, not sign, double vertical line and minus sign failed the test for Shift-JIS, too.

  • G3 characters means those EUC characters whose leading byte is hex 8F and are 3 bytes long. Those characters are categorized in G3 set in the standard.

    Table: Conversion Mapping Inconsistency for EUC

    Unicode : Java    ORCL  Solaris  Character Name
       ---- : ----    ----  -------  ------------------------
       00A2 : A1F1    FF1F     -     CENT SIGN
       00A3 : A1F2    FF1F     -     POUND SIGN
       00AC : A2CC    FF1F     -     NOT SIGN
       00AF : 8FA2B4  FF1F     -     MACRON
       00B8 : 8FA2B1  FF1F     -     CEDILLA
       0111 : 8FA9C2  FF1F     -     LATIN SMALL LETTER D WITH STROKE
       02C7 : 8FA2B0  FF1F     -     CARON
       02D8 : 8FA2AF  FF1F     -     BREVE
       02D9 : 8FA2B2  FF1F     -     DOT ABOVE
       02DA : 8FA2B6  FF1F     -     RING ABOVE
       02DB : 8FA2B5  FF1F     -     OGONEK
       02DD : 8FA2B3  FF1F     -     DOUBLE ACUTE ACCENT
       0384 : 8FA2B8  FF1F     -     GREEK TONOS
       0385 : 8FA2B9  FF1F     -     GREEK DIALYTIKA TONOS
       2016 : A1C2    FF1F     -     DOUBLE VERTICAL LINE
       2212 : A1DD    FF1F     -     MINUS SIGN

    • FF1F is a replacement character.


    E-mail this page
    Printer View Printer View
    Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy