Endian problems (was Re: [omniORB] Client hangs while trying to narrow reference specified by corbaloc)

Wed Aug 2 14:58:23 BST 2006

It appears that I did not diagnose this problem correctly. On further
testing, I determined that the machine-to-machine connection works for
the case of two little endian architectures but not for the case of
one big endian and one little endian. Removing the use of the omniORB
INS POA did not change that fact, so that is no longer a factor
(thankfully). As an aside, my previous test of the Linux-to-Windows
case failed because the Windows Firewall was getting in the way. That
was the result of a completely separate issue that I still need to
resolve.

If the client tries to connect to the server using a stringified
object reference, I get a SequenceIsTooLong marshaling exception from
within IOP::IOR::unmarshaltype_id(). Tracing into it, I find that the
sequence length is supposed to be 32, but the byte order does not get
swapped correctly, thus causing omniORB to think that the sequence is
supposed to be much, much longer. This happens when the client is big
endian and the server is little endian and vice versa. If the
endianness matches for both machines, then there are no problems.

This did not happen before with stringified references. My primary way
of testing the use of CORBA was to use a big endian machine and a
little endian machine. Recently, I have not been using that approach,
so I cannot pinpoint exactly when things went wrong. My current guess
is that the switch I made from omniORB 4.0.6 to 4.0.7 is the most
likely cause, but I have not yet tested this theory. I know that I
switched to omniORB 4.0.7 between the last phase of my project and the
current phase and that I was using the big endian/little endian
testing model throughout the last phase.

My next step will be to try backing off to omniORB 4.0.6. If that
works, I will probably stick with omniORB 4.0.6.

 -Patrick

On 7/27/06, Patrick Hartling <patrick.hartling at gmail.com> wrote:
> On 7/27/06, Duncan Grisby <duncan at grisby.org> wrote:
> > On Tuesday 25 July, "Patrick Hartling" wrote:
> >
> > > Everything works very well if I have the client and server running on
> > > the same machine. If I run on two separate machines, however, the
> > > client hangs while trying to narrow the reference to the bootstrap
> > > object that it gets back from CORBA::ORB::string_to_object(). It
> > > reports the following error:
> >
> > Does it hang?  From the trace you sent, it looks like it got an
> > exception.
>
> Well, I say it hangs because the _narrow() call never returns. It just
> keeps going through a cycle of sending a message, waiting for a while,
> getting the WaitingForReply exception, and then trying again. I should
> have been more clear in my message about when the client and server
> pause.
>
> > > omniORB: Client attempt to connect to giop:tcp:192.168.1.199:42000
> > > omniORB: AsyncInvoker: thread id = 1 has started. Total threads = 2
> > > omniORB: giopRendezvouser task execute for giop:tcp:192.168.1.183:37128
> > > omniORB: AsyncInvoker: thread id = 2 has started. Total threads = 2
> > > omniORB: Scavenger task execute.
> > > omniORB: Client opened connection to giop:tcp:192.168.1.199:42000
> > > omniORB: sendChunk: to giop:tcp:192.168.1.199:42000 98 bytes
>
> There is a long pause here, which I take to mean that the client is
> waiting on the server.
>
> > > omniORB: inputMessage: from giop:tcp:192.168.1.199:42000 12 bytes
> >
> > This message is a CloseConnection message...
> >
> > > omniORB: throw giopStream::CommFailure from
> > > giopImpl10.cc:298(1,NO,COMM_FAILURE_WaitingForReply)
> >
> > ...so the client says communication failed.
> >
> > > The server reports a communication failure while trying to un-marshal
> > > arguments:
> > >
> > > omniORB: Server accepted connection from giop:tcp:192.168.1.183:37129
> > > omniORB: AsyncInvoker: thread id = 5 has started. Total threads = 3
> > > omniORB: Scavenger task execute.
> > > omniORB: AsyncInvoker: thread id = 6 has started. Total threads = 3
> > > omniORB: giopWorker task execute.
> > > omniORB: Accepted connection from giop:tcp:192.168.1.183:37129 because
> > > of this rule: "* bidir,tcp"
> > > omniORB: inputMessage: from giop:tcp:192.168.1.183:37129 98 bytes
>
> There is a long pause here before the next line of output is printed.
>
> > > omniORB: sendCloseConnection: to giop:tcp:192.168.1.183:37129 12 bytes
> >
> > The server immediately closed the connection when it got the call. I
> > can't explain why that happened. Try running with traceLevel 40 to see
> > the GIOP messages. That might give some idea of what's going on.

[trace output removed]

-- 
Patrick L. Hartling
http://www.137.org/patrick/