[omniORB] Windows timeout during ORB shutdown

Tue Jan 20 09:48:49 UTC 2026

Hello,

To speed up testing of Tango, I am trying to shutdown and restart 
"Tango" (and thus the ORB) in a single process.  I believe I have got 
this (at least mostly) working on Linux, however, on Windows I run into 
an assertion failure inside omniORB when I go to restart the ORB after 
it has been destroyed.  The assertion failure reports that the 
`giopServer` singleton is in `ZOMBIE` state:

     omniORB: (11) 2026-01-19 12:02:41.819000: Assertion failed. This 
indicates a bug in the application
     using omniORB, or maybe in omniORB itself.
       file: giopServer.cc
       line: 154
       info: pd_state != ZOMBIE

When I set `ORBtraceLevel=25`, I can clearly see that we are timing out 
during shutdown after the initial log line below has been emitted.  This 
marks where some thread (10) starts waiting for the dedicated threads 
(6, 9), which themselves seem to run into `CommFailure` issues ~5 
seconds later:

     omniORB: (10) 2026-01-19 12:40:21.734000: Wait for 2 dedicated 
threads to finish...
     omniORB: (9) 2026-01-19 12:40:26.156000: Error in network receive 
(start of message): giop:tcp:[::ffff:10.0.2.15]:50344
     omniORB: (9) 2026-01-19 12:40:26.156000: throw 
giopStream::CommFailure from 
giopStream.cc:857(0,NO,COMM_FAILURE_UnMarshalArguments)
     omniORB: (9) 2026-01-19 12:40:26.156000: Server connection 
giop:tcp:[::ffff:10.0.2.15]:50344 refcount = 1
     omniORB: (9) 2026-01-19 12:40:26.156000: removeConnectionAndWorker 
for dedicated worker. 1 remaining.
     omniORB: (9) 2026-01-19 12:40:26.156000: Server connection 
giop:tcp:[::ffff:10.0.2.15]:50344 refcount = 0
     omniORB: (9) 2026-01-19 12:40:26.156000: Server close connection 
from giop:tcp:[::ffff:10.0.2.15]:50344
     omniORB: (9) 2026-01-19 12:40:26.156000: AsyncInvoker: thread id 9 
finished immediate server task.
     omniORB: (10) 2026-01-19 12:40:26.734000: Timed out. 1 connection 
and 1 dedicated worker remaining.
     omniORB: (10) 2026-01-19 12:40:26.734000: Terminate rendezvousers...
     omniORB: (10) 2026-01-19 12:40:26.734000: giopRendezvouser for 
giop:tcp:10.0.2.15:54321 terminate...
     omniORB: (6) 2026-01-19 12:40:26.734000: Error in network receive 
(start of message): giop:tcp:[::1]:50343
     omniORB: (6) 2026-01-19 12:40:26.734000: throw 
giopStream::CommFailure from 
giopStream.cc:857(0,NO,COMM_FAILURE_UnMarshalArguments)
     omniORB: (6) 2026-01-19 12:40:26.734000: Server connection 
giop:tcp:[::1]:50343 refcount = 1
     omniORB: (6) 2026-01-19 12:40:26.734000: removeConnectionAndWorker 
for dedicated worker. 0 remaining.
     omniORB: (6) 2026-01-19 12:40:26.734000: Server connection 
giop:tcp:[::1]:50343 refcount = 0
     omniORB: (6) 2026-01-19 12:40:26.734000: Server close connection 
from giop:tcp:[::1]:50343
     omniORB: (3) 2026-01-19 12:40:26.734000: giopServer terminate 
connection from giop:tcp:[::ffff:10.0.2.15]:50348.
     omniORB: (6) 2026-01-19 12:40:26.734000: AsyncInvoker: thread id 6 
finished immediate server task.
     omniORB: (3) 2026-01-19 12:40:26.734000: giopRendezvouser for 
endpoint giop:tcp:10.0.2.15:54321 exit.
     omniORB: (3) 2026-01-19 12:40:26.734000: No remaining rendezvousers.
     omniORB: (3) 2026-01-19 12:40:26.734000: AsyncInvoker: thread id 3 
finished immediate general task.
     omniORB: (10) 2026-01-19 12:40:26.734000: Rendezvousers terminated.
     omniORB: (10) 2026-01-19 12:40:26.734000: giopServer deactivated.
     omniORB: (10) 2026-01-19 12:40:26.734000: Destruction of 
POA(RootPOA) complete.

 From looking at giopServer.cc, I believe the timing out as we wait for 
the dedicated threads is what puts us into the `ZOMBIE` state.

Unfortunately, I don't have the CORBA chops to easily create a simple 
reproduction of this for you.  The assertion failure occurs during a 
call to `orb->resolve_initial_references("RootPOA")`, after we have 
shutdown and destroyed the ORB previously with a call to `orb->destroy()`.

I have done some debugging with Visual Studio and I believe the issue is 
that, unlike on Linux, the winsock2 socket implementation does not mark 
the file descriptor as ready to read when `shutdown()` is called[1].  
This means that the worker threads timeout from their `select()` call, 
rather than being woken like they do on Linux, and then the thread 
waiting for the workers also times out (as both timeouts come from 
ORBscanGranularity).

I suspect that reason we are hitting the `ZOMBIE` state is related to 
the fact that I am starting and stopping the server so quickly.  The 
worker thread starts its `select()` call at around the same time as the 
shutdown occurs and we start trying to join the worker threads.  This 
means if the worker thread times out, then the "main" thread also times 
out.  In a more realistic scenario, you would expect these two events to 
be more staggered, and the main thread would not time out (I think).

However, as I am interested in speeding up tests, I would also like to 
avoid the timeout on the worker threads all together if possible, as 
well as, avoiding the `ZOMBIE` state.

As per the assertion failure message, is there something I am doing 
wrong here when cleaning up the ORB on Windows, or is this a bug in 
omniORB itself?

Thanks for your help,
Thomas Ives

[1] I have attached a small C++ program (sock_timeout.cpp) that 
approximates the situation I find when restarting Tango and demonstrates 
the difference in socket behavior on Windows and Linux.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sock_timeout.cpp
Type: text/x-c++src
Size: 4636 bytes
Desc: not available
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20260120/84617890/attachment.bin>