[omniORB] Sporadic CommFailure from giopStream.cc:808(0, MAYBE, COMM_FAILURE_WaitingForReply)

Thu Sep 1 11:11:56 BST 2016

On Wed, 2016-08-31 at 15:35 +0000, Goettlicher, Martin via omniORB-list
wrote:

[...]
>  The log
> output shows a call "openPsCnt" (no arguments and an int return value)
> which succeeded, followed by a call which failed, followed by a call
> which succeeded again. When the call fails at the client side, the
> application code in the server isn't called at all.

It's an interaction between connection closure on the server side and
the client only learning about it after it has sent its request.

Looking at the server (skipping lots of the irrelevant data) we have...

> omniORB: (5) 2016-08-31 16:38:26.293984: inputMessage: from giop:tcp:XXX.XXX.XXX.XXX:42299 68 bytes
[...]
> omniORB: (5) 2016-08-31 16:38:26.294132: sendChunk: to giop:tcp:XXX.XXX.XXX.XXX:42299 28 bytes
> omniORB: (5) 2016-08-31 16:38:26.294160: 
> 4749 4f50 0102 0101 1000 0000 0400 0000 GIOP............
> 0000 0000 0000 0000 0000 0000           ............

That's the client's request and the server's reply.

> omniORB: (3) 2016-08-31 16:38:34.479282: Scavenger reduce idle count for strand 0x7fcb480009f0 to 0
> omniORB: (3) 2016-08-31 16:38:34.479298: Scavenger close connection from giop:tcp:XXX.XXX.XXX.XXX:42299
> omniORB: (3) 2016-08-31 16:38:34.479310: sendCloseConnection: to giop:tcp:XXX.XXX.XXX.XXX:42299 12 bytes
> omniORB: (3) 2016-08-31 16:38:34.479322: 
> 4749 4f50 0102 0105 0000 0000           GIOP........

The server has decided that the connection is idle, so it has closed it.

> omniORB: (5) 2016-08-31 16:38:34.479474: Error in network receive (start of message): giop:tcp:XXX.XXX.XXX.XXX:42299
[...]
> omniORB: (5) 2016-08-31 16:38:34.479653: Server close connection from giop:tcp:XXX.XXX.XXX.XXX:42299
> omniORB: (5) 2016-08-31 16:38:34.479670: AsyncInvoker: thread id 5 finished immediate server task.

The dedicated thread for the connection notices that the scavenger has
closed it and tidies up after itself.

Now, on the client, just after that...

> omniORB: (0) 2016-08-31 16:38:34.488485: sendChunk: to giop:tcp:XXX.XXX.XXX.XXX:60030 68 bytes
> omniORB: (0) 2016-08-31 16:38:34.488500: 
> 4749 4f50 0102 0100 3800 0000 0600 0000 GIOP....8.......
> 0300 0000 0000 3266 0e00 0000 fe5a e9c6 ......2f.....Z..
> 5700 0063 b700 0000 0001 3631 0d00 0000 W..c......61....
> 6765 744f 7065 6e50 7343 6e74 0066 3530 getOpenPsCnt.f50
> 0000 0000                               ....

The client tries to send a message on the connection, unaware that the
server has closed it. Due to the way TCP works (even though the server
is on the same host), the TCP send succeeds!

> omniORB: (0) 2016-08-31 16:38:34.488618: Error in network receive (start of message): giop:tcp:XXX.XXX.XXX.XXX:60030

The client tries to read the reply for its request, and now receives a
socket error.

> omniORB: (0) 2016-08-31 16:38:34.488638: throw giopStream::CommFailure from giopStream.cc:808(0,MAYBE,COMM_FAILURE_WaitingForReply)
[...]
> omniORB: (0) 2016-08-31 16:38:34.488769: throw COMM_FAILURE from omniObjRef.cc:706 (MAYBE,COMM_FAILURE_WaitingForReply)

The communication has failed. Notice that the exception has the
completion status COMPLETED_MAYBE. The client knows it finished sending
its request, and that the connection broke before it got the reply. It
can't know whether the server got the request or not. Perhaps the server
got the request but the connection broke before it could send its reply.
Since omniORB can't know whether your operation is idempotent (i.e. can
safely be retried), it has to throw the exception to your code to handle
it.

If the socket error had happened when omniORB was sending the request
(e.g. if the request parameters were big enough that the TCP layer
noticed the closed connection while sending), the exception would have
had the status COMPLETED_NO, and omniORB would have automatically
retried.

By far the easiest way to avoid this kind of situation is to make sure
that in normal circumstances it is clients that close connections,
rather than the server. You have set both inConScanPeriod and
outConScanPeriod to 10, meaning the client and server race with each
other to close connections. You should set inConScanPeriod a fair bit
larger than outConScanPeriod. That way the server won't close
connections when the client isn't expecting it. If you want
outConScanPeriod 10, I'd set inConScanPeriod to 20. (The default values
are 120 and 180.)

If your operations are idempotent, or you're willing to take the
(actually very small) risk of a repeated call, you can use a
COMM_FAILURE exception handler to automatically retry calls even if they
have status COMPLETED_MAYBE:

http://omniorb.sourceforge.net/omni42/omniORB/omniORB004.html#toc26

Regards,

Duncan.

-- 
 -- Duncan Grisby         --
  -- duncan at grisby.org     --
   -- http://www.grisby.org --