[omniORB] omniORBpy and Python 2.0 Unicode

Duncan Grisby dgrisby@uk.research.att.com
Wed, 13 Dec 2000 12:35:35 +0000


On Tuesday 12 December, uche.ogbuji@fourthought.com wrote:

> > Even if you do modify the marshal functions to do the right thing, it
> > is wrong to transmit UTF-8. All strings in GIOP 1.0, which omniORB 3
> > uses, should be ISO 8859-1.
> 
> I know, but I was desperate.  My first question is: are there any
> plans to support GIOP 1.2?  My second is, can Python Unicode be
> supported by mapping to wstring and CORBA::WStringValue, as in the
> Java binding?

Yes, the combination of omniORB 4 and omniORBpy 2 will support GIOP
1.2 and wstring. In fact, the development trees in CVS already do. As
I said before, feel free to play with them to see what's coming up.
You most definitely should not use them for serious work.

The Python mapping for wstring is the Python Unicode type.

> Speaking of Java, JavaIDL has some interesting marshalling behavior
> that I couldn't find in the binding.  Basically, you can pass a
> java.lang.String in for an IDL string as long as it can be mapped to
> ISO-8859-1 without any encoding.  As soon as you introduce, say
> "\u0660", it dies with a org.omg.CORBA.DATA_CONVERSION conversion.
> This means, I'm guessing, that our 4Suite Server IDLs (which use
> "string") have a nasty time bomb for Java compatability.

That is what is required by the CORBA spec. The way code sets work in
CORBA is that each end-point has a "native" code set, which the
application uses to talk to the ORB. To transmit from one process to
another, the two ORBs negotiate a "transmission" code set which they
both understand. The string can therefore be translated between code
sets once or twice as it is transmitted. If either translation meets a
character which cannot be represented in the target code set, a
DATA_CONVERSION exception is raised.

Now, with GIOP 1.0, the transmission code set is _always_ ISO 8859-1.
JavaIDL therefore raises DATA_CONVERSION when the application tries to
send a string containing non 8859-1 characters. With GIOP 1.2, JavaIDL
prefers to use UTF-8, meaning it can represent the whole Unicode set
(in fact, the whole UCS-4 set). If the receiving ORB has UTF-8 as its
native code set, any Java string can be received successfully. If the
receiving ORB uses a different native code set, the string might be
successfully transmitted as UTF-8, only for the receiving ORB to raise
DATA_CONVERSION because it can't map to the native code set.

> Basically, we're faced with a CORBA nightmare.  If we use wstring
> throughout, the IDL, it will break our Python code.  If we leave it
> as "string", Java will blow up as soon as someone from East Asia or
> Israel gives it a try.  Am I missing some way out?

I think the best approach is to stick with string in your IDL. For
now, that will limit you to only using ISO 8859-1 when you talk to
Python, but two Java processes can use the whole Unicode set. When
omniORBpy 2 (or some other code set aware Python ORB) comes along, the
Python code will be able to use UTF-8 as its native string code set,
and thus handle any Unicode character.

Note that IDL string will still map to Python string, it's just that
the contents of the string can be UTF-8 rather than ISO 8859-1. It
will still be necessary to manually convert the Python strings to and
from Python Unicode objects.

> > I really think it's best to convert the Python unicode objects to
> > strings in the Python code, rather than modifying the marshalling
> > code.
> 
> What about people calling from Java, innocently using Unicode?
> Also, isn't this quite a burden to impose on the caller, to require
> an encoding every time he calls a remote object?  It rather does in
> the idea of language and platform transparency, no?

With the code set negotiation in GIOP 1.2, this is all as transparent
as it can possibly be. Unfortunately, the application always has to
understand something about code sets, and the fact that different sets
may be incompatible. The support is very flexible, though -- one
application using Chinese Big5 encoding can transparently talk to
another using Unicode, for example, without either application having
to know anything about the other.

> Och, do I have a headache now.

I recommend that you read section 13.9 of the CORBA specification. It
might not help your headache (in fact, it will probably make it worse
:-) ), but it will help you understand the way CORBA deals with
different code sets. You can get chapter 13 from

  ftp://ftp.omg.org/pub/docs/formal/00-10-17.pdf


Cheers,

Duncan.

-- 
 -- Duncan Grisby  \  Research Engineer  --
  -- AT&T Laboratories Cambridge          --
   -- http://www.uk.research.att.com/~dpg1 --