[omniORB] OmniOrb and CP1252 (Windows Latin 1) vs. ISO-8859-1

Duncan Grisby duncan at grisby.org
Tue Jul 29 12:49:50 BST 2008


On Monday 28 July, Steven Sauder wrote:

[...]
> After a bit of investigation, it seems that OmniOrb by default uses
> ISO-8859-1 as the “native” codeset

Yes. That is required by the CORBA spec.

> , which I had thought would mean that the Euro symbol
> (and a couple of other “special” characters such as the trademark symbol, and
> the “curly” printers quotes), which are represented in CP1252, but not in
> ISO-8859-1, could not be handled by OmniOrb using its default codeset.
>  However, digging into cs-8859-1.cc a little more, it looks like the
> translation tables ARE passing 0x80 through to UCS as 0x0080, so unless I’m
> reading this wrong, any OmniOrb-to-OmniOrb communications (on Windows) should
> pass the (Windows-specific) Euro code point 0x80 through without problem.  Am
> I reading this right?

Yes, but 0x0080 in Unicode is not the Euro symbol. ISO 8859-1 0x80 maps
to Unicode 0x0080, which maps back to ISO 8859-1 0x80, so if you're
pretending to use ISO 8859-1 while actually using CP1252 at both ends it
will appear to work. It's only when someone tries to interpret the
Unicode as some other code set that you notice the error.

The same is true if you are using any other string codeset while
claiming to use ISO 8859-1 -- it's just things will be more obviously
wrong when conversions to other code sets occur.

> However, the difficulty arises because we have several CORBA
> components which are written using the standard Java ORB, which (it
> appears) is not providing the same amount of leeway with this symbol,
> and insists on transmitting the Euro symbol in it’s “true” UCS16
> representation (0x20AC), which OmniOrb’s codeset converters end up
> turning into a “?” when we receive it on the Windows end.

Actually, I'd expect you to get a CORBA::DATA_CONVERSION exception since
0x20AC can't be mapped to ISO 8859-1. The Java ORB must be substituting
the character rather than throwing the exception the CORBA spec says it
should.

> Has anyone had any experience with this?  From what I’ve read so far,
> it seems the only viable solution would be to write our own NCS-C
> implementation that handled the CP1252 Euro symbol (0x80) to Unicode
> (0x20AC) and back-again conversion through the translation tables as
> is currently happening in cs-8859-1.cc, is this correct?

Yes, that's the right thing to do. There are quite a few other 8 bit
code sets that it would be sensible to add too, including ISO 8859-15
which is equivalent to ISO 8859-1 but includes the Euro symbol at code
point 0xA4.

Another alternative would be to use UTF-8 and manually convert your
strings to that before passing them into the CORBA layer.

If you want to make a CP1252 codeset for omniORB, you can automatically
generate the tables using bin/scripts/make8bitcs.py giving it input from
here:

  http://www.unicode.org/Public/MAPPINGS/

The DCE codeset ids come from here:

  ftp://ftp.opengroup.org/pub/code_set_registry/code_set_registry1.2g.txt


Any volunteers to make a patch containing all the tables for the
additional ISO 8859 and Windows codesets in it?

Cheers,

Duncan.

-- 
 -- Duncan Grisby         --
  -- duncan at grisby.org     --
   -- http://www.grisby.org --



More information about the omniORB-list mailing list