[omniORB] Bidir connection problem

Wed Mar 12 15:05:19 GMT 2014

Hello.

We've got a problem with bidir connection I can't get to solve.

This is our setup:

_ server side we've got an application based on omniORB 4.1.6;

_ clients use JacORB and connect to the server with a bidir policy;

_ the server periodically makes oneway calls to clients; in order to 
avoid blocking on these, ClientCallTimeout was set to five seconds;

_ since we need to transfer huge amounts of data, we raised 
giopMaxMsgSize. We tried several values and found out that the bigger 
this parameter is, the higher is the chance of seeing the problem, 
especially on slow lines. However using the default value does not mean 
the problem never happens, it's just more unlikely to happen (and harder 
to reproduce).

The sequence of events that lead to the problem is the following:

_ the client connects to the server: bidir policy works and the two ends 
are able to exchange messages;

_ then the client makes a long call to the server: this seems to block 
the only active connection for other uses;

_ meanwhile the server makes a call to the client, which times out; so 
the server closes the connection. In the logs I see:

> 2014-Mar-12 14:10:11: omniORB: Timed out waiting for write lock: giop:tcp:10.1.2.26:61583
> 2014-Mar-12 14:10:11: omniORB: throw giopStream::CommFailure from giopStream.cc:331(0,NO,TRANSIENT_CallTimedout)
> 2014-Mar-12 14:10:11: omniORB: Unexpected error encountered in talking to the server giop:tcp:10.1.2.26:61583. The connection is closed immediately.  GIOP_C state 2, strand state 0
> 2014-Mar-12 14:10:11: omniORB: throw TRANSIENT from omniObjRef.cc:809 (NO,TRANSIENT_CallTimedout)

Notice that while the notification is lost, the client side call completes.

_ Client side, JacORB sees the connection is gone and opens a new one:
> [jacorb.orb.iiop] DEBUG : Transport to 10.1.2.15:60606: stream closed on read  < 0
> [jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): getMessage() -- COMM_FAILURE
> [jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): streamClosed()
> [jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): closeAllowReopen()
> [jacorb.orb.iiop] INFO : Client-side TCP transport to 10.1.2.15:60606 closed.
> [jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): will wait until connected
> 524589
> [jacorb.giop.conn] DEBUG : ClientGIOPConnection to 10.1.2.15:60606 (5f064398): sendMessage() -- opening transport
> [jacorb.orb.iiop] DEBUG : Trying to connect to 10.1.2.15:60606 with timeout=90000.
> [jacorb.orb.iiop] INFO : Connected to 10.1.2.15:60606 from local port 47203

_ from now on, while the client can call the server fine, any call from 
the server to the client will fail with:
> 2014-Mar-12 14:10:21: 2014-Mar-12 14:10:21: omniORB: throw TRANSIENT from giopBiDir.cc:447 (NO,TRANSIENT_BiDirConnIsGone)
> 2014-Mar-12 14:10:21: omniORB: throw TRANSIENT from omniObjRef.cc:822 (NO,TRANSIENT_BiDirConnIsGone)

What we'd like to achieve:

_ ideally the "notification" should overlap with the client side call; I 
know this is probably hard, if not impossible, to achive, especially 
when limited bandwidth is the reason behind the timeout, so we could 
live with a lost call;

_ what I really don't understand is why any subsequent call fails, even 
if the client side traffic has dropped.

Should the old connection really close on a failed oneway call, even if 
data is incoming fine on this connection?

Then again, why isn't the second connection used?

Here is omniORB's log when it's accepted:
> 2014-Mar-12 14:10:20: omniORB: Server accepted connection from giop:tcp:10.1.2.26:47203
> 2014-Mar-12 14:10:20: omniORB: giopWorker task execute.
> 2014-Mar-12 14:10:20: omniORB: Accepted connection from giop:tcp:10.1.2.26:47203 because of this rule: "* tcp,ssl,bidir"
> 2014-Mar-12 14:10:20: omniORB: inputMessage: from giop:tcp:10.1.2.26:47203 179 bytes
> 2014-Mar-12 14:10:20: omniORB:
> 2014-Mar-12 14:10:20: omniORB: Receive codeset service context and set TCS to (UTF-8,UTF-16)
> 2014-Mar-12 14:10:20: omniORB: Receive bidir IIOP service context: ( 10.1.2.26:39278 10.1.2.26:44795 )
> 2014-Mar-12 14:10:20: omniORB: Accepted request from giop:tcp:10.1.2.26:47203 to switch to bidirectional because of this rule: "* tcp,ssl,bidir"

Bidir is again established, so why is it not used?

Any help is appreciated.

  Thanks
	av.