[omniORB] deadlock

Fri Sep 16 08:48:43 BST 2016

Hi,

we recently experienced a deadlock in omniORB 4.2.1. I list the stack traces of the
relevant threads here:

Thread 14 (Thread 0x7f4c77bfd700 (LWP 157649)):
#0  0x000000326220e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003262209508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000032622093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f4c8dd4d29a in lock (this=0x14c57b0) at ../../../../include/omnithread.h:257
#4  omni_mutex_lock (m=..., this=<synthetic pointer>) at ../../../../include/omnithread.h:301
#5  omni::omniOrbPOA::activate_object (this=0x14c7140, p_servant=0x7f4c68005840) at poa.cc:910
#6  0x00000000004227fd in marketData::Server_impl::createSession (this=0x14d7350, LoginSessionId=200, cb=0x7f4c680050c0) at /home/build/Builds/Trader-build1010/Trader/src/MarketDataServer/MarketData_impl.cpp:1218
#7  0x00000000004c8382 in _0RL_lcfn_3eb5b2c318c3678b_80000000 (cd=0x7f4c77bfca30, svnt=<optimized out>) at libidl/MarketData_skel.cpp:966
#8  0x00007f4c8dd5e0b1 in doLocalCall (servant=0x14d7898, this=0x7f4c77bfca30) at ../../../../include/omniORB4/callDescriptor.h:145
#9  omniCallHandle::upcall (this=this at entry=0x7f4c77bfcbd0, servant=0x14d7898, desc=...) at callHandle.cc:123
#10 0x00000000004c868c in marketData::_impl_Server::_dispatch (this=0x14d7350, _handle=...) at libidl/MarketData_skel.cpp:1127
#11 0x00007f4c8dd568bd in omni::omniOrbPOA::dispatch (this=<optimized out>, handle=..., id=0x14d8080) at poa.cc:1712
#12 0x00007f4c8dd30298 in omniLocalIdentity::dispatch (this=0x14d8080, handle=...) at localIdentity.cc:182
#13 0x00007f4c8dd7dafa in omni::GIOP_S::handleRequest (this=this at entry=0x7f4c680008c0) at GIOP_S.cc:245
#14 0x00007f4c8dd7eb18 in omni::GIOP_S::dispatcher (this=0x7f4c680008c0) at GIOP_S.cc:155
#15 0x00007f4c8dd7ac25 in omni::giopWorker::execute (this=0x7f4c78001030) at giopWorker.cc:79
#16 0x00007f4c8dd21b56 in omniAsyncWorker::real_run (this=0x7f4c78001090) at invoker.cc:580
#17 0x00007f4c8dd22d2f in omniAsyncPoolServer::workerRun (this=<optimized out>, worker=<optimized out>) at invoker.cc:330
#18 0x00007f4c8dd2179c in omniAsyncWorker::mid_run (this=0x7f4c78001090) at invoker.cc:513
#19 0x00007f4c8dd22cca in omniAsyncWorker::run (this=0x7f4c78001090) at invoker.cc:128
#20 0x00007f4c8e0245e3 in omni_thread_wrapper (ptr=0x7f4c78001090) at posix.cc:457
#21 0x00000032622079d1 in start_thread () from /lib64/libpthread.so.0
#22 0x0000003261ee89dd in clone () from /lib64/libc.so.6

(gdb) frame 3
(gdb) p *this
$2 = {posix_mutex = {__data = {__lock = 2, __count = 0, __owner = 157985, __nusers = 14, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000!i\002\000\016", '\000' <repeats 26 times>, __align = 2}}

from activate_object:
  omni_tracedmutex_lock sync(pd_lock);
  CHECK_NOT_DYING();
  omni_tracedmutex_lock sync2(*omni::internalLock);

we have pd_lock and wand internalLock

Thread 96 (Thread 0x7f4bf03e1700 (LWP 157985)):
#0  0x000000326220e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003262209508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000032622093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f4c8dd4e7ea in lock (this=0x14c7248) at ../../../../include/omnithread.h:257
#4  omni::omniOrbPOA::synchronise_request (this=this at entry=0x14c7140, lid=lid at entry=0x14d8080) at poa.cc:2906
#5  0x00007f4c8dd567db in omni::omniOrbPOA::dispatch (this=0x14c7140, handle=..., id=0x14d8080) at poa.cc:1671
#6  0x00007f4c8dd30298 in omniLocalIdentity::dispatch (this=0x14d8080, handle=...) at localIdentity.cc:182
#7  0x00007f4c8dd7dafa in omni::GIOP_S::handleRequest (this=this at entry=0x7f4b780008c0) at GIOP_S.cc:245
#8  0x00007f4c8dd7eb18 in omni::GIOP_S::dispatcher (this=0x7f4b780008c0) at GIOP_S.cc:155
#9  0x00007f4c8dd7ac25 in omni::giopWorker::execute (this=0x7f4c78013730) at giopWorker.cc:79
#10 0x00007f4c8dd21b56 in omniAsyncWorker::real_run (this=0x7f4c78013790) at invoker.cc:580
#11 0x00007f4c8dd22d2f in omniAsyncPoolServer::workerRun (this=<optimized out>, worker=<optimized out>) at invoker.cc:330
#12 0x00007f4c8dd2179c in omniAsyncWorker::mid_run (this=0x7f4c78013790) at invoker.cc:513
#13 0x00007f4c8dd22cca in omniAsyncWorker::run (this=0x7f4c78013790) at invoker.cc:128
#14 0x00007f4c8e0245e3 in omni_thread_wrapper (ptr=0x7f4c78013790) at posix.cc:457
#15 0x00000032622079d1 in start_thread () from /lib64/libpthread.so.0
#16 0x0000003261ee89dd in clone () from /lib64/libc.so.6

(gdb) frame 3
(gdb) p *this
$1 = {posix_mutex = {__data = {__lock = 2, __count = 0, __owner = 157649, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = "\002\000\000\000\000\000\000\000\321g\002\000\001", '\000' <repeats 26 times>, __align = 2}}

we have omni::internalLock and want pd_lock.
(gdb) up
(gdb) p pd_rq_state
$5 = 1 (this is PortableServer::POAManager::ACTIVE), so internal lock is not unlocked.

This leads to Thread 14 having pd_lock and wanting *omni::internalLock
          and Thread 96 having *omni::internalLock and wanting pd_lock.   

Mutexes should always locked in the same order. 

A solution could be in omniOrbPOA::synchronise_request to always unlock internalLock first and lock
it again after having pd_lock, but I am not sure of the consequences of freeing internalLock for a short time.
Perhaps it is necessary to check all these locks for consistency in mutex order?

LG,
  Michael