[omniORB] LocateRequest gets dropped on RHEL 6.1 system

Fri Sep 9 18:12:28 BST 2011

On Thu, 2011-09-08 at 14:08 -0500, Brad Fawcett wrote:

> Yes,  the Client & Server are on the same machine.
> Attached are the client & server traces with traceLevel 40 traceTime 1
> traceThreadId 1, and the tcpdump as well.

The mailing list ate your big attachments, but you CCed me so I got
them...

> server log (line 577 - time = 13:36:43.433633)  input Message
> received    //   <<< this is the suspect packet, that I think started
> out as the locateRequest.  if I'm reading it right, the GIOP internal
> message size (261 bytes) is larger then the tcp packet size (38).

Something very odd is going on. The client sends this LocateRequest:

omniORB: (5) 2011-09-08 13:36:43.433568: sendChunk: to giop:tcp:10.6.36.36:59790 38 bytes
omniORB: (5) 2011-09-08 13:36:43.433575: 
4749 4f50 0102 0103 1a00 0000 1400 0000 GIOP............
0000 0000 0e00 0000 fe19 fd68 4e00 0061 ...........hN..a
da00 0000 0000                          ......

but the server sees this

omniORB: (6) 2011-09-08 13:36:43.433633: inputMessage: from giop:tcp:[::ffff:10.6.36.36]:33014 38 bytes
omniORB: (6) 2011-09-08 13:36:43.433657: 
4749 4f50 0102 0100 0501 0000 1200 0000 GIOP............
0000 0000 0000 0000 0e00 0000 fe19 fd68 ...............h
4e00 0061 da00                          N..a..

As I mentioned before, that's the start of a valid oneway request
message, rather than the LocateRequest. It's got the right length
though. Where did the data come from?  Well, if we look at the previous
message the server got on that connection, we see...

omniORB: (6) 2011-09-08 13:36:42.792230: inputMessage: from giop:tcp:[::ffff:10.6.36.36]:33014 273 bytes
omniORB: (6) 2011-09-08 13:36:42.792246: 
4749 4f50 0102 0100 0501 0000 1200 0000 GIOP............
0000 0000 0000 0000 0e00 0000 fe19 fd68 ...............h
4e00 0061 da00 0000 0000 0000 0d00 0000 N..a............
6c6f 675f 6163 7469 7669 7479 0000 0000 log_activity....
0000 0000 3130 2e36 8100 0000 3030 3732 ....10.6....0072
3033 6434 6236 3234 6261 3061 3731 3862 03d4b624ba0a718b
6364 3232 3535 3935 3634 6436 3932 3934 cd22559564d69294
3234 3737 3637 3530 3833 3139 3661 3563 2477675083196a5c
3938 6536 3031 3333 3336 3865 3635 3461 98e60133368e654a
3639 3464 3864 6135 3333 3431 3664 3064 694d8da533416d0d
3838 3839 3763 3637 3734 3037 6334 3462 88897c677407c44b
3333 3734 3864 6430 3931 3761 3139 3164 33748dd0917a191d
3733 3064 6431 6363 3963 3763 0064 3064 730dd1cc9c7c.d0d
0500 0000 6a6f 6273 0034 3037 1300 0000 ....jobs.407....
7265 6769 7374 6572 4a6f 6257 6974 6841 registerJobWithA
4153 0064 1900 0000 6a6f 6220 3020 6973 AS.d....job 0 is
2062 6569 6e67 2073 7562 6d69 7474 6564  being submitted
00                                      .

That's a valid and complete oneway request. Its first 38 bytes are the
data that was seen in the next message.

Somehow, when omniORB tries to receive a message, it gets the correct
length for the message, but has the old data. Now, omniORB gets the data
by calling recv() on the socket, giving it a buffer. The buffer is
reused between messages, so it contains the old data at the time recv()
is called. recv() returns the correct message length, but either it
fills the buffer with the old data, or it doesn't fill the buffer at
all, leaving the old data. Either way, it looks to me like an OS /
platform bug rather than an omniORB bug.

It might be illuminating to edit src/lib/omniORB/orbcore/giopStream.cc
in inputMessage so it uses memset to clear the data just before calling
Recv. That will show if recv is returning old data or not returning data
at all.

Cheers,

Duncan.

-- 
 -- Duncan Grisby         --
  -- duncan at grisby.org     --
   -- http://www.grisby.org --