[omniORB] Bug report: omniORB manifests timeouts with TRANSIENT_ConnectFailed exception

Serguei Kolos Serguei.Kolos at cern.ch
Thu Jun 16 13:23:11 BST 2011


Hi Duncan

This happens rarely due to the known inaccuracy of the timeouts
implementation in the "poll" system call, which varies from 1 to 10
milliseconds for different platforms. The issue is that when the
doConnect function in the tcpAddress.cc file detects timeout (lines
356-358) it returns 0 just as in the case of a simple connection failure.

Then the giopStream::errorOnSend function tries to recover the true
reason of the failure by comparing the timeout deadline with the
current time. On modern computers this calculation may take less
time then the "poll" timeout inaccuracy. For example if the timeout
was set to expire at 03:28:50.900000, the poll system call may
already return zero (which indicates timeout) at 03:28:50.899000,
i.e. 1 millisecond before the timeout deadline. The code after that
can be executed so fast that the timeout condition check in the
giopStream.cc (lines 1293-1294) will yield the negative result and
the issue will be finally reported as TRANSIENT_ConnectFailure.

I have attached the patch, which provides simple workaround for
the issue. It does the 10ms range check in the giopStream::errorOnSend
function, but I would consider it as an ugly and improper hack.
A proper fix would most likely imply the change in the communication
protocol interface (for example the giopAddress::Connect abstract
function may be redefined to return an indication of a timeout explicitly).

Can you please confirm correctness of the analysis and tell me if the
proposed fix is reasonable? Do you have any better idea on a possible fix?

Cheers,
Sergei

PS: The patch is made for the omniORB 4.1.5
-------------- next part --------------
--- giopStream.cc	2009-05-06 18:16:18.000000000 +0200
+++ /afs/cern.ch/user/k/kolos/patches/omni/src/lib/omniORB/orbcore/giopStream.cc	2011-06-16 11:59:45.000000000 +0200
@@ -1291,7 +1291,8 @@
       unsigned long s, ns;
       omni_thread::get_time(&s, &ns);
       if (s > pd_deadline_secs ||
-          (s == pd_deadline_secs && ns > pd_deadline_nanosecs)) {
+          (s == pd_deadline_secs && (ns > pd_deadline_nanosecs || (pd_deadline_nanosecs-ns) < 10000000)) ||
+          (pd_deadline_secs-s == 1 && (pd_deadline_nanosecs+1000000000-ns) < 10000000)) {
         
         retry = 0;
         minor = TRANSIENT_CallTimedout;


More information about the omniORB-list mailing list