[omniORB] omniORBpy and Python 2.0 Unicode

Tue, 12 Dec 2000 16:34:32 -0700

> On Tuesday 12 December, uche.ogbuji@fourthought.com wrote:
> 
> > Well, I didn't get any help on my unicode problem, and I needed to
> > get around it, so I worked up the following patch to allow omniORBpy
> > marshall Python 2.0 Unicode objects by converting them to UTF-8
> > strings.
> 
> Eek!  That doesn't do what you think it does. Or at least if it does,
> it's only working because of a side-effect. In omniORBpy, the
> marshalling works in two phases. In the first phase, the values being
> marshalled are checked for validity, and their marshalled size is
> calculated. In the second phase, the data is actually sent over the
> network. Your patch only modifies the first phase. This means that
> when the second phase, in the two marshalPyObject functions, does:

<snip>

OK.  I saw the other code, but I must have missed the exchange of state.  
Oops.  Just looked again.  Duh!  All I did was mask the BAD_PARAM.  If I'd 
been using strings that truly needed to be escpaed, it would have exploded.

I'm pretty sure I could get around this as well, but, of course, you bring up 
more important issues below (which is part of what I was hoping to provoke).

> Even if you do modify the marshal functions to do the right thing, it
> is wrong to transmit UTF-8. All strings in GIOP 1.0, which omniORB 3
> uses, should be ISO 8859-1.

I know, but I was desperate.  My first question is: are there any plans to 
support GIOP 1.2?  My second is, can Python Unicode be supported by mapping to 
 wstring and CORBA::WStringValue, as in the Java binding?

Speaking of Java, JavaIDL has some interesting marshalling behavior that I 
couldn't find in the binding.  Basically, you can pass a java.lang.String in 
for an IDL string as long as it can be mapped to ISO-8859-1 without any 
encoding.  As soon as you introduce, say "\u0660", it dies with a 
org.omg.CORBA.DATA_CONVERSION conversion.  This means, I'm guessing, that our 
4Suite Server IDLs (which use "string") have a nasty time bomb for Java 
compatability.

Basically, we're faced with a CORBA nightmare.  If we use wstring throughout, 
the IDL, it will break our Python code.  If we leave it as "string", Java will 
blow up as soon as someone from East Asia or Israel gives it a try.  Am I 
missing some way out?

> Of course, if both ends of the
> communication agree that they are using UTF-8, it works anyway, but
> it's asking for trouble. If you sent the string to a Java ORB, for
> example, it would build a Java unicode string assuming the transmitted
> string was 8859-1.

Understood.  There is no quick fix.  *sigh*

> I really think it's best to convert the Python unicode objects to
> strings in the Python code, rather than modifying the marshalling
> code.

What about people calling from Java, innocently using Unicode?  Also, isn't 
this quite a burden to impose on the caller, to require an encoding every time 
he calls a remote object?  It rather does in the idea of language and platform 
transparency, no?

Och, do I have a headache now.

-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python