[DO-SIG] Re: [omniORB] omniORBpy, Python 2.0 and unicode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Mon, 11 Dec 2000 19:01:55 +0100


> > The Python/CORBA binding doesn't say anything about unicode.

By careful reading, you'll notice that the description of
wchar/wstring in the language mapping spec indeed talks about Unicode
:-)

It provides a wrapper by means of the CORBA.wstr function, although
I'd expect applications to rely on the fact that the output of
CORBA.wstr really is a Unicode object. Once there is implementation
experience with the Python Unicode type in the IDL mapping, it might
be reasonable to formally codify that wstring *is* the Unicode type.

> At present, parameters described as string in IDL must be Python
> strings.

I think that should stay that way, at least from the mapping point of
view.

> The next major releases of omniORB and omniORBpy (4.0 and 2.0
> respectively) will fully support CORBA's code set negotiation, and the
> wstring type. What I've implemented at the moment is that wstring maps
> to Python unicode, but string still only maps to Python string.

That is very reasonable.

> Strings can, however, be in any supported code set, not just ISO
> 8859-1. That includes UTF-8, so the whole of the Unicode space (and
> more) can be supported.

How does the application express which encoding a string is in?

> It would not be much effort to extend omniORBpy so it accepted
> unicode objects when it was expecting strings, but I'm not sure it's
> a good idea. Following the general Python mantra of "explicit is
> better than implicit", I'd lean towards forcing the programmer to
> convert their unicode objects to strings in their chosen encoding,
> rather than having the ORB do it.

In general, I'd advise against using the CORBA string type unless the
data is known to be ASCII only. Use wstring for true text data, and
pass Unicode objects in Python.

The problem with using different encodings for the string type is that
this assumes the notion of a "process character set", which is not
available in Python. If it was, then the ORB should negotiate that
process character set (or conversion from/to it), and then the
application should encode its strings in the process character set.

Python 1.6 introduced sys.getdefaultencoding(), but Python 2
restricted that to always be 'ascii'. If that restriction is ever
lifted, then it would be reasonable to assume that string values are
encoded in the default encoding.

Regards,
Martin