Character Integrity Issues in NLS Environment
This note describes a character data translation issue in an NLS environment.
Abstract
Oracle JDBC drivers perform character set conversion as appropriate when
character data is inserted into or retrieved from the database, i.e. the
drivers convert Unicode characters used by Java clients to Oracle database
character set characters, and vise versa (described in "Using NLS" in Chapter
5, "Advanced Topics", in the JDBC Developers Guide). Java runtime systems
provided by various vendors (Sun, Netscape, Microsoft) convert operating
system characterset characters into the Unicode characters used by Java
and vise versa. Due to differences in the character set conversions by
JDBC and the Java runtime systems, character data making a round trip from
the operating system character set to the Java Unicode character set to
the database character set and back to Java and then back to the operating
system can suffer some loss of information. This is not a bug in either
Oracle JDBC or the Java runtime, but rather is an unfortunate side effect
of the differences in the specifications of these two systems. Fortunately
this problem affects only a small number of multibyte characters. The workaround
is to avoid making a full round trip with these characters. Additionally,
even when the round trip between the operating system and the database
works perfectly, accessing character data via some other path may yield
different results than when the data is accessed via Java and JDBC. This
problem also affects only a small number of multibyte characters. The workaround
is to avoid accessing those characters through both Java and other paths.
The Problem
Character data translation errors can occur when the character data is
converted multiple times based on multiple mapping tables which are not
identical. This can happen wherever such multiple translations occur, regardless
of configuration or software in use. For example, a piece of text data
input into a Java application to be stored into database will go through
character set conversion twice. The first one will happen when the text
goes from the operating system to the Java runtime to convert to Unicode
from the character set used by the front end layer. This conversion could
be performed by your browser, underlying operating system, a GUI widget
or virtual machine depending on the architecture of system. The second
will happen when storing the text into database as the text is converted
to the database character set.
There are two problem scenarios:
1) Consider the following character mappings:
| from os |
to Java |
from Java |
to os |
| o1 |
j1 |
j1 |
o1 |
| o2 |
j2 |
j2 |
o2 |
| o3 |
j3 |
j3 |
o3 |
| from Java |
to DB |
from DB |
to Java |
| j1 |
d1 |
d1 |
j1 |
| j2 |
d1 |
|
|
| j3 |
d2 |
d2 |
j3 |
The user enters character o1 into the operating system. The Java runtime
translates that into Java character j1. If the user then reads the character
the Java runtime will translate j1 back to o1. If the user stores the character
into the database, JDBC will translate j1 into d1. If the user reads the
character JDBC will translate d1 back to j1 and then the Java runtime will
translate j1 back to o1, the character the user entered initially. This
is the case for the vast majority of characters.
But consider what happens if the user enters a problematic character,
o2. The Java runtime will translate o2 into j2. JDBC will translate j2
into d1. On the return trip, JDBC will translate d1 into j1 and the Java
runtime will translate j1 into o1. The user entered o2 and got back o1.
Both translations worked exactly as specified. It is only the unfortunate
interaction between the two translations that caused the problem. Changing
either translation table would break a large number of existing applications.
2) In addition to the above character mappings, consider the following:
| from os |
to DB |
from DB |
to os |
| o1 |
d2 |
d2 |
o1 |
Using SQL*Plus or some other mechanism to access the database other
than Java the user enters the character o1. That character is translated
to d2 for storage in the database. Then the user accesses the database
via Java. JDBC translates d2 into j3 and the Java runtime translates j3
into o3. In this case the user entered o1 and got back o3 even though o1
makes a correct round trip via Java/JDBC or via SQL*Plus. It is only by
combining Java with another mechanism that the translation problem occurs.
Again, each of the translation tables is working as specified, it is only
the interactions of the tables that cause the problem.
The only solution to these problems is to avoid the problematic characters
or to avoid making the kinds of round trips that cause the problem. Changing
the mappings is not a possible fix as the the mappings are working as specified
and many existing applications depend upon those specifications.
There are many definitions of character sets and mappings provided by
various organizations. Consistency in data exchange is maintained by equivalent
character sets and mappings, but in the real world there are a few exceptions.
In our growing network of heterogeneous systems, the absence of truly standard
definition could lead to the possibility of translation errors becoming
ubiquitous. In practice it is very difficult to resolve this issue as changing
the present mismatched translation definitions breaks existing working
applications. Oracle is working with other vendors and the appropriate
standards bodies to address these issues. Due to the complexity of the
problem and the large number of products and systems that rely upon the
existing character sets and mappings, solutions will not be easy to find.
Problem Characters
This section summarizes by character set the result of a character integrity
test using Oracle JDBC drivers. Japanese Shift-JIS and EUC character sets
have been tested. A few Japanese characters(out of thousands) were found
to be corrupted, while all the other characters retain integrity. Please
note that the information herein does not intend to make up a part of the
specification. It is provided for informational purpose only and the findings
are subject to change.
Shift-JIS (JA16SJIS)
For those 5 characters whose Oracle mapping to SJIS is 8148, successful
round trip conversion can be achieved only between Java and the execution
environment, and not all the way to the database and back. If you insert
them to Oracle you will only get question marks when you fetch them back.
In addition if you are going to use IE or Microsoft virtual machine 5 additional
characters(that have questions in MSFT column) are turned into question
mark in MS Java virtual machine (Tested on build 2924/SDK for Java 3.1.)
Double byte ampersand cannot be used with the Microsoft Java virtual machine
due to round trip conversion failure to and from Windows.
The same set of characters fail against EUC and JA16EUC databases.
Table: Conversion Mapping Inconsistency for SJIS
Unicode : Java ORCL MSFT Character Name
---- : ---- ---- ---- ---------------------------
00A2 : 8191 8148 -
CENT SIGN
00A3 : 8192 8148 -
POUND SIGN
00AC : 8198 8148 -
NOT SIGN
2016 : 8161 8148 -
DOUBLE VERTICAL LINE
2212 : 817C 8148 -
MINUS SIGN
301C : 8160 8160 ?
WAVE DASH
309B : 814A 814A ?
VOICED SOUND MARK
309C : 814B 814B ?
SEMI-VOICED SOUND MARK
FF06 : 8195 8195 X
AMPERSAND
FF9E : DE DE
? HALFWIDTH VOICED SOUND MARK
FF9F : DF DF
? HALFWIDTH SEMI-VOICED SOUND MARK
8148 is a replacement character(question mark) which appears when there
is no mapping for the character.
EUC (JA16EUC)
For the listed 16 characters whose Oracle mapping to EUC is FF1F, successful
round trip conversion can be achieved only within VM, and not between VM
and database. If you insert them to Oracle you will only get question marks
when you fetch them back.
G3 characters cannot be used against Shift-JIS database. If used, you will
see question marks after server round trip just like the listed 16 characters.
This means that if you are going to use the full set of EUC including G3
set on clients you must use EUC database. This is due to the limitation
of the number of characters which can fit into the code range of Shift-JIS
character set.
Cent sign, pound sign, not sign, double vertical line and minus sign failed
the test for Shift-JIS, too.
G3 characters means those EUC characters whose leading
byte is hex 8F and are 3 bytes long. Those characters are categorized in
G3 set in the standard.
Table: Conversion Mapping Inconsistency for EUC
Unicode : Java ORCL Solaris Character
Name
---- : ---- ---- -------
------------------------
00A2 : A1F1 FF1F
- CENT SIGN
00A3 : A1F2 FF1F
- POUND SIGN
00AC : A2CC FF1F
- NOT SIGN
00AF : 8FA2B4 FF1F -
MACRON
00B8 : 8FA2B1 FF1F -
CEDILLA
0111 : 8FA9C2 FF1F -
LATIN SMALL LETTER D WITH STROKE
02C7 : 8FA2B0 FF1F -
CARON
02D8 : 8FA2AF FF1F -
BREVE
02D9 : 8FA2B2 FF1F -
DOT ABOVE
02DA : 8FA2B6 FF1F -
RING ABOVE
02DB : 8FA2B5 FF1F -
OGONEK
02DD : 8FA2B3 FF1F -
DOUBLE ACUTE ACCENT
0384 : 8FA2B8 FF1F -
GREEK TONOS
0385 : 8FA2B9 FF1F -
GREEK DIALYTIKA TONOS
2016 : A1C2 FF1F
- DOUBLE VERTICAL LINE
2212 : A1DD FF1F
- MINUS SIGN
-
FF1F is a replacement character.
|