[omniORB] Why do oneway requests hang on disconnected network?

Thu Nov 16 13:15:31 GMT 2006

On Wednesday 15 November, Tuyen Chau wrote:

> Why do "oneway" requests hang, instead of return an COM_FAILURE
> exception, when the network is disconnected?  As part of testing our
> product, we unplugged the network cable.  We were surprised to find
> that these oneway requests executed without errors for a good 5-10
> minutes or so, then they blocked indefinitely.  If we replaced the
> network cable, the calls eventually unblocked and everything worked
> again.  Our best guess at the moment is that there is a data buffer
> for outgoing requests and the oneway requests block when the buffer is
> full.

omniORB doesn't buffer requests at all. In a oneway request, it simply
sends the data through the TCP socket and carries on its way. The
buffering you are seeing is in the TCP stack. Eventually, if the server
isn't responding (because the cable's not there), the TCP stack will
block when omniORB tries to send.

If the OS doesn't notice that the connection is broken, it won't tell
omniORB when omniORB tries to send, which is why you see that you can
send lots of oneway requests before anything untoward happens. The way
TCP works, there's no way to tell that a cable has been unplugged and
quickly close the connection.

> Is there any way to alter this behavior and receive a COM_FAILURE
> exception instead?

If you set a timeout on the calls, they will timeout if the send call
blocks, leading to a COMM_FAILURE exception. That won't make it fail any
quicker, though, because the send won't block and therefore timeout
until the TCP buffers are full.

The only other alternative is to modify omniORB so it sets the
SO_KEEPALIVE socket option on its tcp sockets. That way the OS will send
keepalive packets, and tear down the connection if the keepalives are
lost. But with that, you're at the mercy of the OS as to when it starts
sending keepalives, and once it does, how often it sends them and how
many must go missing before it gives up. See this from the Linux tcp
manpage for example:

SYSCTLS
       These variables can be accessed by the /proc/sys/net/ipv4/*  files  or
       with the sysctl(2) interface.  In addition, most IP sysctls also apply
       to TCP; see ip(7).
...
       tcp_keepalive_intvl
              The  number  of  seconds  between  TCP  keep-alive probes.  The
              default value is 75 seconds.

       tcp_keepalive_probes
              The maximum number of TCP keep-alive probes to send before giv-
              ing  up  and  killing the connection if no response is obtained
              from the other end.  The default value is 9.

       tcp_keepalive_time
              The number of seconds a connection needs to be idle before  TCP
              begins  sending  out  keep-alive  probes.  Keep-alives are only
              sent when the  SO_KEEPALIVE  socket  option  is  enabled.   The
              default value is 7200 seconds (2 hours).  An idle connection is
              terminated after approximately  an  additional  11  minutes  (9
              probes  an  interval  of  75  seconds apart) when keep-alive is
              enabled.

The default times mean that SO_KEEPALIVE is basically useless for your
situation unless you radically reduce the times, but the settings are
for the whole machine, not just your process.

Cheers,

Duncan.

-- 
 -- Duncan Grisby         --
  -- duncan at grisby.org     --
   -- http://www.grisby.org --