[omniORB] Windows timeout during ORB shutdown
Thomas Ives
tri at observatorysciences.co.uk
Tue Jan 20 09:48:49 UTC 2026
Hello,
To speed up testing of Tango, I am trying to shutdown and restart
"Tango" (and thus the ORB) in a single process. I believe I have got
this (at least mostly) working on Linux, however, on Windows I run into
an assertion failure inside omniORB when I go to restart the ORB after
it has been destroyed. The assertion failure reports that the
`giopServer` singleton is in `ZOMBIE` state:
omniORB: (11) 2026-01-19 12:02:41.819000: Assertion failed. This
indicates a bug in the application
using omniORB, or maybe in omniORB itself.
file: giopServer.cc
line: 154
info: pd_state != ZOMBIE
When I set `ORBtraceLevel=25`, I can clearly see that we are timing out
during shutdown after the initial log line below has been emitted. This
marks where some thread (10) starts waiting for the dedicated threads
(6, 9), which themselves seem to run into `CommFailure` issues ~5
seconds later:
omniORB: (10) 2026-01-19 12:40:21.734000: Wait for 2 dedicated
threads to finish...
omniORB: (9) 2026-01-19 12:40:26.156000: Error in network receive
(start of message): giop:tcp:[::ffff:10.0.2.15]:50344
omniORB: (9) 2026-01-19 12:40:26.156000: throw
giopStream::CommFailure from
giopStream.cc:857(0,NO,COMM_FAILURE_UnMarshalArguments)
omniORB: (9) 2026-01-19 12:40:26.156000: Server connection
giop:tcp:[::ffff:10.0.2.15]:50344 refcount = 1
omniORB: (9) 2026-01-19 12:40:26.156000: removeConnectionAndWorker
for dedicated worker. 1 remaining.
omniORB: (9) 2026-01-19 12:40:26.156000: Server connection
giop:tcp:[::ffff:10.0.2.15]:50344 refcount = 0
omniORB: (9) 2026-01-19 12:40:26.156000: Server close connection
from giop:tcp:[::ffff:10.0.2.15]:50344
omniORB: (9) 2026-01-19 12:40:26.156000: AsyncInvoker: thread id 9
finished immediate server task.
omniORB: (10) 2026-01-19 12:40:26.734000: Timed out. 1 connection
and 1 dedicated worker remaining.
omniORB: (10) 2026-01-19 12:40:26.734000: Terminate rendezvousers...
omniORB: (10) 2026-01-19 12:40:26.734000: giopRendezvouser for
giop:tcp:10.0.2.15:54321 terminate...
omniORB: (6) 2026-01-19 12:40:26.734000: Error in network receive
(start of message): giop:tcp:[::1]:50343
omniORB: (6) 2026-01-19 12:40:26.734000: throw
giopStream::CommFailure from
giopStream.cc:857(0,NO,COMM_FAILURE_UnMarshalArguments)
omniORB: (6) 2026-01-19 12:40:26.734000: Server connection
giop:tcp:[::1]:50343 refcount = 1
omniORB: (6) 2026-01-19 12:40:26.734000: removeConnectionAndWorker
for dedicated worker. 0 remaining.
omniORB: (6) 2026-01-19 12:40:26.734000: Server connection
giop:tcp:[::1]:50343 refcount = 0
omniORB: (6) 2026-01-19 12:40:26.734000: Server close connection
from giop:tcp:[::1]:50343
omniORB: (3) 2026-01-19 12:40:26.734000: giopServer terminate
connection from giop:tcp:[::ffff:10.0.2.15]:50348.
omniORB: (6) 2026-01-19 12:40:26.734000: AsyncInvoker: thread id 6
finished immediate server task.
omniORB: (3) 2026-01-19 12:40:26.734000: giopRendezvouser for
endpoint giop:tcp:10.0.2.15:54321 exit.
omniORB: (3) 2026-01-19 12:40:26.734000: No remaining rendezvousers.
omniORB: (3) 2026-01-19 12:40:26.734000: AsyncInvoker: thread id 3
finished immediate general task.
omniORB: (10) 2026-01-19 12:40:26.734000: Rendezvousers terminated.
omniORB: (10) 2026-01-19 12:40:26.734000: giopServer deactivated.
omniORB: (10) 2026-01-19 12:40:26.734000: Destruction of
POA(RootPOA) complete.
From looking at giopServer.cc, I believe the timing out as we wait for
the dedicated threads is what puts us into the `ZOMBIE` state.
Unfortunately, I don't have the CORBA chops to easily create a simple
reproduction of this for you. The assertion failure occurs during a
call to `orb->resolve_initial_references("RootPOA")`, after we have
shutdown and destroyed the ORB previously with a call to `orb->destroy()`.
I have done some debugging with Visual Studio and I believe the issue is
that, unlike on Linux, the winsock2 socket implementation does not mark
the file descriptor as ready to read when `shutdown()` is called[1].
This means that the worker threads timeout from their `select()` call,
rather than being woken like they do on Linux, and then the thread
waiting for the workers also times out (as both timeouts come from
ORBscanGranularity).
I suspect that reason we are hitting the `ZOMBIE` state is related to
the fact that I am starting and stopping the server so quickly. The
worker thread starts its `select()` call at around the same time as the
shutdown occurs and we start trying to join the worker threads. This
means if the worker thread times out, then the "main" thread also times
out. In a more realistic scenario, you would expect these two events to
be more staggered, and the main thread would not time out (I think).
However, as I am interested in speeding up tests, I would also like to
avoid the timeout on the worker threads all together if possible, as
well as, avoiding the `ZOMBIE` state.
As per the assertion failure message, is there something I am doing
wrong here when cleaning up the ORB on Windows, or is this a bug in
omniORB itself?
Thanks for your help,
Thomas Ives
[1] I have attached a small C++ program (sock_timeout.cpp) that
approximates the situation I find when restarting Tango and demonstrates
the difference in socket behavior on Windows and Linux.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sock_timeout.cpp
Type: text/x-c++src
Size: 4636 bytes
Desc: not available
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20260120/84617890/attachment.bin>
More information about the omniORB-list
mailing list