[omniORB] omniORB4: deadlock in giopServer / SocketCollection: patch

Bastiaan Bakker Bastiaan.Bakker@lifeline.nl
Tue, 5 Feb 2002 14:56:13 +0100


Hi,

Here's a small patch using approach 1. It appears to resolve the =
deadlock issue. However I have the bad feeling that for a proper =
solution some larger modifications are necessary.
Some comments:
I've moved the deletion to the end of the method, because I wasn't sure =
whether I could release the lock in the middle of the method.
A question for the developers: why doesn't removeConnectionAndWorker =
handle setting the lock itself, but leaves it to the caller =
(notfiyWkDone) instead? It is confusing, and together with the patch =
results in an unnecessary extra lock()/unlock().
Earlier in the removeConnectionAndWorker method clearSelectable() is =
called, surrounded by unlock() and lock(). Was this meant to avoid the =
deadlock I try to patch now?=20

Good luck,

Bastiaan Bakker
LifeLine Networks bv


> -----Original Message-----
> From: Bastiaan Bakker=20
> Sent: Tuesday, February 05, 2002 2:02 PM
> To: omniORB-list (E-mail) (E-mail)
> Subject: [omniORB] omniORB4: deadlock in giopServer / SocketCollection
>=20
>=20
> Hi,
>=20
> Unfortunately the EBADF patch did not solve all my stability=20
> problems:=20
> On the postive side: the eg2_impl test now runs 5 to 10 times=20
> longer and doesn't crash.
> On the negative side: instead of crashing it deadlocks and=20
> stops servicing requests.
>=20
> The deadlock I analysed is caused by the following situation:=20
> we have two threads
> thread 1: a giopWorker thread (i.e. doing=20
> omni::giopWorker::execute()).
> thread 2: a giopRendezvouser thread (i.e. doing=20
> omni:giopRendezvouser::execute()).
>=20
> thread 1 holds lock giopServer::pd_lock, acquired in=20
> giopServer::removeConnectionAndWorker():796=20
> thread 1 wants lock SocketCollection::pd_fdset_lock in=20
> SocketCollection::removeSocket():445
> thread 2 holds lock SocketColleciton::pd_fdset_lock, acquired=20
> in SocketCollectionSelect():299
> thread 2 wants lock giopServer::pd_lock in=20
> giopServer:notifyRzReadable():737
>=20
> A partial stack trace for thread 1 (line numbers may differ=20
> due to addition of log statements):
> #5 0xff25fc10 in omni::SocketCollection::removeSocket=20
> (this=3D0x2ac68, sock=3D7) at SocketCollection.cc:445
> #6 0xff284f78 in omni::unixConnection::~unixConnection=20
> (this=3D0x2e288, __in_chrg=3D3) at ./unix/unixConnection.cc:272
> #7 0xff232ff4 in omni::giopConnection::decrRefCount=20
> (this=3D0x2e288, forced=3Dfalse) at giopEndpoint.cc:259
> #8 0xff236b60 in omni::giopStrand::deleteStrandAndConnection=20
> (this=3D0x2e2c0, forced=3Dfalse) at giopStrand.cc:265
> #9 0xff23fa14 in=20
> omni::giopServer::connectionState::~connectionState=20
> (this=3D0x28600, __in_chrg=3D3) at giopServer.cc:502
> #10 0xff241278 in omni::giopServer::removeConnectionAndWorker=20
> (this=3D0x2a9f0, w=3D0x2e760) at giopServer.cc:807
> #11 0xff2413dc in omni::giopServer::notifyWkDone=20
> (this=3D0x2a9f0, w=3D0x2e760, exit_on_error=3Dtrue) at =
giopServer.cc:832
> #12 0xff242d84 in omni::giopWorker::execute (this=3D0x2e760) at=20
> giopWorker.cc:173
>=20
> And for thread 2:
> #8 0xff240dd8 in omni::giopServer::notifyRzReadable=20
> (this=3D0x2a9f0, conn=3D0x2da80, force_create=3Dfalse) at =
giopServer.cc:737
> #9 0xff24313c in omni::giopRendezvouser::notifyReadable=20
> (this_=3D0x2b278, conn=3D0x2da80) at giopRendezvouser.cc:56
> #10 0xff28622c in omni::unixEndpoint::notifyReadable=20
> (this=3D0x2ac60, fd=3D11) at ./unix/unixEndpoint.cc:226
> #11 0xff25f0d4 in omni::SocketCollection::Select=20
> (this=3D0x2ac68) at SocketCollection.cc:317
> #12 0xff285f30 in omni::unixEndpoint::AcceptAndMonitor=20
> (this=3D0x2ac60, func=3D0xff243110=20
> <omni::giopRendezvouser::notifyReadable(void *,=20
> omni::giopConnection *)>, cookie=3D0x2b278) at=20
> ./unix/unixEndpoint.cc:200
> #13 0xff2431a8 in omni::giopRendezvouser::execute (this=3D0x2b278)
>=20
> Prosed solutions.
>=20
> Approach 1:
> I don't see a good reason why thread 1 needs to hold lock=20
> giopServer::pd_lock at deletion of the ConnectionState in=20
> line 807. So we could delay deletion until the end of the=20
> method, where we temporarily unlock.
>=20
> Approach 2:
> Alternatively, I don't see a reason why thread 2 needs to=20
> hold pd_fdset_lock while calling notifyReadable() in Select()=20
> line 317 either. Wouldn't it be better to first do the fd_set=20
> cleanup and then (after release of the lock) call=20
> notifyReadable() on all file descriptors with pending data?
>=20
> As I'm rather new to omniORBs internals any feedback on these=20
> suggestions is very welcome.
>=20
> Cheers,
>=20
> Bastiaan Bakker
> LifeLine Networks bv
> =20
>=20
>=20