[omniORB] Windows timeout during ORB shutdown
Duncan Grisby
duncan at grisby.org
Fri Jan 30 16:58:08 UTC 2026
On Mon, 2026-01-26 at 21:48 +0000, Thomas Ives via omniORB-list wrote:
> I have managed to reproduce the issue by modifying echo/eg1.cc
> example.
> This test reliably reproduces the issue for me (whereas for my
> Tango
> test I will occasionally not see the problem). I have attached the
> modified eg1.cc program and a log of the output it produces
> (eg1.out).
> As far as I can tell, the assertion failure is due to a bug in
> omniORB and not the program.
Thank you for spending the time to investigate that and provide a
simple reproducer.
> omniORB is assuming that waiting for scanGranularity seconds is
> enough time for all the dedicated threads to finish, however, on
> Windows each of those threads can wait for up to scanGranularity
> seconds itself, so if you have many threads the main thread times out
> waiting.
Yes, that is indeed what is happening. It is a race condition between
the timeout of the thread trying to stop everything and the threads
that are blocked waiting for incoming calls. The mistake is to use the
same scanGranularity value for both.
> > As you say, Windows makes life much harder than Linux (and Unix
> platforms) because shutting down a socket does not wake up threads
> waiting on it, hence all the issues with timeouts.
>
> I have done some digging into the winsock2 documentation and it
> appears that we can wake up the selecting thread using Win32's
> QueueUserAPC function. The winsock2 documentation for select[1] says
> the following in a note towards the end of the remarks:
That is interesting, but I think it is too fragile to use here. In your
test, all the connection worker threads are blocked in select, but in
general there is no way to know if the worker threads are in select at
the time, or whether they are busy doing some other work that might be
damaged by the asynchronous call. As you say in your code comment,
there are also situations in which there are multiple threads handling
a single connection, which would make it much more complex.
I think a much safer fix is simply to make sure that the shutdown
timeout is longer than the scanGranularity, to give all the threads
time to stop if they are idle. Obviously that does not make it
completely impossible for the situation to occur, but I think it will
always work in practice.
Real servers, as opposed to test code, are not expected to rapidly
start and stop repeatedly, so I don't think the additional significant
complexity to make it faster would be a good idea.
I'm going to commit the attached small change. I suggest that for your
test code you set scanGranularity to 1. That will mean it takes a
second to stop, which is still quite long, but faster than the default
5 seconds.
Thanks,
Duncan.
--
Duncan Grisby <duncan at grisby.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shutdown_timeout.patch
Type: text/x-patch
Size: 475 bytes
Desc: not available
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20260130/70a5718a/attachment.bin>
More information about the omniORB-list
mailing list