[omniORB] omniORBpy and Python 2.0 Unicode

Duncan Grisby dgrisby@uk.research.att.com
Tue, 12 Dec 2000 10:32:18 +0000


On Tuesday 12 December, uche.ogbuji@fourthought.com wrote:

> Well, I didn't get any help on my unicode problem, and I needed to
> get around it, so I worked up the following patch to allow omniORBpy
> marshall Python 2.0 Unicode objects by converting them to UTF-8
> strings.

Eek!  That doesn't do what you think it does. Or at least if it does,
it's only working because of a side-effect. In omniORBpy, the
marshalling works in two phases. In the first phase, the values being
marshalled are checked for validity, and their marshalled size is
calculated. In the second phase, the data is actually sent over the
network. Your patch only modifies the first phase. This means that
when the second phase, in the two marshalPyObject functions, does:

  case CORBA::tk_string: // max_length
    {
      CORBA::ULong slen = PyString_GET_SIZE(a_o) + 1;

      slen >>= stream;

      if (slen > 1) {
	char* str = PyString_AS_STRING(a_o);
	stream.put_char_array((const CORBA::Char*)((const char*)str),
			      slen);
      }
      else {
	CORBA::Char('\0') >>= stream;
      }
    }
    break;

You're in trouble, because the code assumes that the first phase has
made sure the object is of type PyString. Since it uses the
non-checking macro access functions for speed, it blindly accesses the
PyUnicode object as if it was a PyString. If you are getting away with
it, it's just because the PyUnicode object happens to have the right
layout in memory. Looking at the Python headers, I don't think it
would work, but you never know.

Even if you do modify the marshal functions to do the right thing, it
is wrong to transmit UTF-8. All strings in GIOP 1.0, which omniORB 3
uses, should be ISO 8859-1. Of course, if both ends of the
communication agree that they are using UTF-8, it works anyway, but
it's asking for trouble. If you sent the string to a Java ORB, for
example, it would build a Java unicode string assuming the transmitted
string was 8859-1.

I really think it's best to convert the Python unicode objects to
strings in the Python code, rather than modifying the marshalling
code.

Cheers,

Duncan.

-- 
 -- Duncan Grisby  \  Research Engineer  --
  -- AT&T Laboratories Cambridge          --
   -- http://www.uk.research.att.com/~dpg1 --