[omniORB] omniORB4: deadlock in giopServer / SocketCollection

Bastiaan Bakker Bastiaan.Bakker@lifeline.nl
Tue, 5 Feb 2002 14:01:51 +0100


Hi,

Unfortunately the EBADF patch did not solve all my stability problems:=20
On the postive side: the eg2_impl test now runs 5 to 10 times longer and =
doesn't crash.
On the negative side: instead of crashing it deadlocks and stops =
servicing requests.

The deadlock I analysed is caused by the following situation: we have =
two threads
thread 1: a giopWorker thread (i.e. doing omni::giopWorker::execute()).
thread 2: a giopRendezvouser thread (i.e. doing =
omni:giopRendezvouser::execute()).

thread 1 holds lock giopServer::pd_lock, acquired in =
giopServer::removeConnectionAndWorker():796=20
thread 1 wants lock SocketCollection::pd_fdset_lock in =
SocketCollection::removeSocket():445
thread 2 holds lock SocketColleciton::pd_fdset_lock, acquired in =
SocketCollectionSelect():299
thread 2 wants lock giopServer::pd_lock in =
giopServer:notifyRzReadable():737

A partial stack trace for thread 1 (line numbers may differ due to =
addition of log statements):
#5 0xff25fc10 in omni::SocketCollection::removeSocket (this=3D0x2ac68, =
sock=3D7) at SocketCollection.cc:445
#6 0xff284f78 in omni::unixConnection::~unixConnection (this=3D0x2e288, =
__in_chrg=3D3) at ./unix/unixConnection.cc:272
#7 0xff232ff4 in omni::giopConnection::decrRefCount (this=3D0x2e288, =
forced=3Dfalse) at giopEndpoint.cc:259
#8 0xff236b60 in omni::giopStrand::deleteStrandAndConnection =
(this=3D0x2e2c0, forced=3Dfalse) at giopStrand.cc:265
#9 0xff23fa14 in omni::giopServer::connectionState::~connectionState =
(this=3D0x28600, __in_chrg=3D3) at giopServer.cc:502
#10 0xff241278 in omni::giopServer::removeConnectionAndWorker =
(this=3D0x2a9f0, w=3D0x2e760) at giopServer.cc:807
#11 0xff2413dc in omni::giopServer::notifyWkDone (this=3D0x2a9f0, =
w=3D0x2e760, exit_on_error=3Dtrue) at giopServer.cc:832
#12 0xff242d84 in omni::giopWorker::execute (this=3D0x2e760) at =
giopWorker.cc:173

And for thread 2:
#8 0xff240dd8 in omni::giopServer::notifyRzReadable (this=3D0x2a9f0, =
conn=3D0x2da80, force_create=3Dfalse) at giopServer.cc:737
#9 0xff24313c in omni::giopRendezvouser::notifyReadable =
(this_=3D0x2b278, conn=3D0x2da80) at giopRendezvouser.cc:56
#10 0xff28622c in omni::unixEndpoint::notifyReadable (this=3D0x2ac60, =
fd=3D11) at ./unix/unixEndpoint.cc:226
#11 0xff25f0d4 in omni::SocketCollection::Select (this=3D0x2ac68) at =
SocketCollection.cc:317
#12 0xff285f30 in omni::unixEndpoint::AcceptAndMonitor (this=3D0x2ac60, =
func=3D0xff243110 <omni::giopRendezvouser::notifyReadable(void *, =
omni::giopConnection *)>, cookie=3D0x2b278) at =
./unix/unixEndpoint.cc:200
#13 0xff2431a8 in omni::giopRendezvouser::execute (this=3D0x2b278)

Prosed solutions.

Approach 1:
I don't see a good reason why thread 1 needs to hold lock =
giopServer::pd_lock at deletion of the ConnectionState in line 807. So =
we could delay deletion until the end of the method, where we =
temporarily unlock.

Approach 2:
Alternatively, I don't see a reason why thread 2 needs to hold =
pd_fdset_lock while calling notifyReadable() in Select() line 317 =
either. Wouldn't it be better to first do the fd_set cleanup and then =
(after release of the lock) call notifyReadable() on all file =
descriptors with pending data?

As I'm rather new to omniORBs internals any feedback on these =
suggestions is very welcome.

Cheers,

Bastiaan Bakker
LifeLine Networks bv
=20