[omniORB] deadlock in omniORB

Wed Jul 7 13:02:32 UTC 2021

Yes, the error you have found was fixed in 4.2.3.

The other thing identified by helgrind is a false positive. The
connection objects in question are only owned by one thread at a time,
so it is perfectly fine that different threads access their non-
blocking state at different times while holding no locks. Only the
owning thread touches that data, and changes of ownership are properly
handled with suitable locks.

Duncan.

On Wed, 2021-07-07 at 14:55 +0200, Damijan Skvarc via omniORB-list
wrote:
> HI guys,
> 
> I have continued to work on deadlock issue by compiling omniORB
> library from sources, gathered from 
> http://sourceforge.net/projects/omniorb/files/omniORB/omniORB-4.2.2/omniORB-4.2.2.tar.bz2
> .
> 
> After compiling it with debug information & preventing optimization
> (./configure CXXFLAGS="-g -O0") and after I came back into a deadlock
> situation I was able to get some more information about the reason
> for the deadlock issue. I believe the problem is that the same thread
> is trying to lock the same mutex again.  The following gdb backtrace
> depict the problematic thread:
> 
> (gdb) bt
> #0  __lll_lock_wait () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x0000000004e4e025 in __GI___pthread_mutex_lock (mutex=0x66646f0)
> at ../nptl/pthread_mutex_lock.c:80
> #2  0x0000000004c35fd7 in ?? () from
> /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so
> #3  0x000000000085357d in omni_mutex::lock (this=0x66646f0) at
> /usr/include/omnithread.h:255
> #4  0x0000000000a212e4 in omni_mutex_lock::omni_mutex_lock
> (this=0xcc28af0, m=...) at ../../../../include/omnithread.h:299
> #5  0x0000000000b0c009 in omni::giopStream::errorOnReceive
> (this=0x67cbea8, rc=0, filename=0xbcd661 "giopStream.cc", lineno=404,
> buf=0x0, heldlock=true, 
>     message=0xbcd718 "Timed out sleeping on read lock (waiting for
> receiver thread)") at giopStream.cc:718
> #6  0x0000000000b0b3b6 in omni::giopStream::sleepOnRdLockAlways
> (this=0x67cbea8) at giopStream.cc:404
> #7  0x0000000000b27e9b in omni::giopImpl12::inputReplyBegin
> (g=0x67cbea8, unmarshalHeader=0xb28a6e
> <omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at
> giopImpl12.cc:557
> #8  0x0000000000b28194 in omni::giopImpl12::inputMessageBegin
> (g=0x67cbea8, unmarshalHeader=0xb28a6e
> <omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at
> giopImpl12.cc:626
> #9  0x0000000000b152bb in omni::GIOP_C::IssueLocateRequest
> (this=0x67cbea0) at GIOP_C.cc:167
> #10 0x0000000000af0f9e in omniRemoteIdentity::locateRequest
> (this=0xa350a00, call_desc=...) at remoteIdentity.cc:206
> #11 0x0000000000af076a in omniRemoteIdentity::dispatch
> (this=0xa350a00, call_desc=...) at remoteIdentity.cc:82
> #12 0x0000000000ad2465 in omniObjRef::_invoke (this=0xa350aa0,
> call_desc=..., do_assert=false) at omniObjRef.cc:675
> ......
> 
> In omni::giopImpl12::inputReplyBegin() omniTransportLock is locked
> (line 538) and later on in omni::giopStream::errorOnReceive()
> function (line 718) the same mutex is trying to be locked again.
> 
> After this finding I have checked the code on 4.2.3 version and found
> the problem has already been fixed by modifying errorOnReceive()
> function and locking mutex optionally and not unconditionally. 
> 
> I believe this thread can be closed and with the 4.2.3 version our
> deadlock issue will be overcome.
> 
> 
> Thanks,
> Damijan
> 
> On Mon, Jul 5, 2021 at 3:09 PM Damijan Skvarc
> <damjan.skvarc at gmail.com> wrote:
> > Hi.
> > 
> > With the omniORB4.2.2 version (delivered by Ubunto 18.04) we have
> > started facing a deadlock issue. Once I was able to attach with gdb
> > into the process where such a deadlock happened and found the
> > following situation. There was 21 threads, where one of the  thread
> > was found in:
> > #0 __lll_lock_wait () at
> > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> > #1 0x00007f691f213025 in __GI___pthread_mutex_lock
> > (mutex=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
> > #2 0x00007f691fc1d6c0 in omni::giopStream::errorOnReceive(int, char
> > const*, unsigned int, omni::giopStream_Buffer*, bool, char const*)
> > () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > #3 0x00007f691fc1da4b in omni::giopStream::sleepOnRdLockAlways() ()
> > from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > #4 0x00007f691fc369ac in
> > omni::giopImpl12::inputReplyBegin(omni::giopStream*, void
> > (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-
> > gnu/libomniORB4.so.2
> > #5 0x00007f691fc37c17 in
> > omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> > (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-
> > gnu/libomniORB4.so.2
> > #6 0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from
> > /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > #7 0x00007f691fc047e8 in
> > omniRemoteIdentity::dispatch(omniCallDescriptor&) () from
> > /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > 
> > while the other 20 threads were stuck in 
> > 
> > #0 __lll_lock_wait () at
> > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> > #1 0x00007f691f214fb3 in __pthread_mutex_cond_lock
> > (mutex=mutex at entry=0x55c31d6f4e00) at
> > ../nptl/pthread_mutex_lock.c:80
> > #2 0x00007f691f216e7d in __pthread_cond_wait_common
> > (abstime=0x7f68c5613eb0, mutex=0x55c31d6f4e00, cond=0x55c31d82b9d8)
> > at pthread_cond_wait.c:645
> > #3 __pthread_cond_timedwait (cond=0x55c31d82b9d8,
> > mutex=0x55c31d6f4e00, abstime=0x7f68c5613eb0) at
> > pthread_cond_wait.c:667
> > #4 0x00007f691f90ca84 in omni_condition::timedwait(unsigned long,
> > unsigned long) () from /usr/lib/x86_64-linux-gnu/libomnithread.so.4
> > #5 0x00007f691fc1d9f1 in omni::giopStream::sleepOnRdLockAlways() ()
> > from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > #6 0x00007f691fc369ac in
> > omni::giopImpl12::inputReplyBegin(omni::giopStream*, void
> > (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-
> > gnu/libomniORB4.so.2
> > #7 0x00007f691fc37c17 in
> > omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> > (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-
> > gnu/libomniORB4.so.2
> > #8 0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from
> > /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > #9 0x00007f691fc047e8 in
> > omniRemoteIdentity::dispatch(omniCallDescriptor&) () from
> > /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> > #10 0x00007f691fbe7b3c in omniObjRef::_invoke(omniCallDescriptor&,
> > bool) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2utex, but
> > the call stack
> > 
> > By inspecting pthread mutex at address0x55c31d6f4e00 I found the
> > first thread already owned this mutex, but the call stack depicts
> > the thread is trying to lock it again. It looks the thread for some
> > reason "forget" to unlock it in one of the previous steps.
> > 
> > Since I couldn't get any additional data I have started to
> > instrument the code with helgrind. Unfortunately, due to different
> > timing conditions, I was not able to come into the same obstacle
> > again. I have instrumented all versions from 4.2.0 up to 4.2.4 and
> > in all these versions I've found one another raise condition which
> > happened in the following case:
> > 
> > Possible data race during read of size 1 at 0x67C9275 by thread #1
> > ==9765== Locks held: none
> > ==9765==    at 0xB5ACBA: omni::SocketHolder::setNonBlocking()
> > (SocketCollection.h:108)
> > ==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned
> > long, omni_time_t const&) (tcpConnection.cc:71)
> > ==9765==    by 0xB16D19:
> > omni::giopStream::sendChunk(omni::giopStream_Buffer*)
> > (giopStream.cc:1129)
> > ==9765==    by 0xB3391A:
> > omni::giopImpl12::outputMessageEnd(omni::giopStream*)
> > (giopImpl12.cc:1320)
> > ==9765==    by 0xB1E881: omni::GIOP_C::InitialiseRequest()
> > (GIOP_C.cc:109)
> > ==9765==    by 0xAF9363:
> > omniRemoteIdentity::dispatch(omniCallDescriptor&)
> > (remoteIdentity.cc:97)
> > ==9765==    by 0xADAF14: omniObjRef::_invoke(omniCallDescriptor&,
> > bool) (omniObjRef.cc:675)
> > ==9765==    by 0x873DCD:
> > i_NET::_objref_RemoteNode::GetFlags(i_NET::NodeNames const&)
> > (i_remote_node.cc:1418)
> > ==9765==    by 0x82B49C:
> > mci::RemoteNode::OnGetFlags(std::vector<std::__cxx11::basic_string<
> > char, std::char_traits<char>, std::allocator<char> >,
> > std::allocator<std::__cxx11::basic_string<char, std::char_
> > traits<char>, std::allocator<char> > > > const&) const
> > (remote_node.cpp:333)
> > ==9765==    by 0x822E00:
> > mci::TreeNode::GetFlags(std::vector<std::__cxx11::basic_string<char
> > , std::char_traits<char>, std::allocator<char> >,
> > std::allocator<std::__cxx11::basic_string<char, std::char_trai
> > ts<char>, std::allocator<char> > > > const&) const
> > (tree_node.cpp:94)
> > ==9765==    by 0x7A6190: mci::Node::GetFlags() const (node.cpp:312)
> > ==9765==    by 0x68C0B2: MciProxy::Connect(mci::Root, unsigned int)
> > (mciProxy.cpp:56)
> > ==9765== 
> > ==9765== This conflicts with a previous write of size 1 by thread
> > #8
> > ==9765== Locks held: none
> > ==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
> > (SocketCollection.h:101)
> > ==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned
> > long, omni_time_t const&) (tcpConnection.cc:155)
> > ==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
> > (giopStream.cc:844)
> > ==9765==    by 0xB31149:
> > omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
> > (giopImpl12.cc:438)
> > ==9765==    by 0xB319CB:
> > omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> > (*)(omni::giopStream*)) (giopImpl12.cc:646)
> > ==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
> > ==9765==    by 0xB1DE1E: omni::giopWorker::execute()
> > (giopWorker.cc:77)
> > ==9765==    by 0xAC43A1: omniAsyncWorker::real_run()
> > (invoker.cc:578)
> > 
> > This raise condition happens between write & read operation,
> > however there are also cases where raise happens also between write
> > & write  operation what is more problematic. For example:
> > 
> > ==9765== Possible data race during write of size 1 at 0x67C9275 by
> > thread #7
> > ==9765== Locks held: none
> > ==9765==    at 0xB5ACD7: omni::SocketHolder::setNonBlocking()
> > (SocketCollection.h:110)
> > ==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned
> > long, omni_time_t const&) (tcpConnection.cc:71)
> > ==9765==    by 0xB10FC6:
> > omni::sendCloseConnection(omni::giopStrand*) (giopStrand.cc:121)
> > ==9765==    by 0xB12AE4: omni::Scavenger::execute()
> > (giopStrand.cc:698)
> > ==9765==    by 0xAC43A1: omniAsyncWorker::real_run()
> > (invoker.cc:578)
> > ==9765==    by 0xAC5885:
> > omniAsyncPoolGeneral::workerRun(omniAsyncWorker*) (invoker.cc:297)
> > ==9765==    by 0xAC3F45: omniAsyncWorker::mid_run()
> > (invoker.cc:511)
> > ==9765==    by 0xAC4614: omniAsyncWorkerInfo::run()
> > (invoker.cc:662)
> > ==9765==    by 0xAC55C1: omniAsyncWorker::run(void*)
> > (invoker.cc:126)
> > ==9765==    by 0xB67F33: omni_thread_wrapper (posix.cc:459)
> > ==9765==    by 0x4C38C26: ??? (in
> > /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
> > ==9765==    by 0x4E4B6DA: start_thread (pthread_create.c:463)
> > ==9765== 
> > ==9765== This conflicts with a previous write of size 1 by thread
> > #8
> > ==9765== Locks held: none
> > ==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
> > (SocketCollection.h:101)
> > ==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned
> > long, omni_time_t const&) (tcpConnection.cc:155)
> > ==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
> > (giopStream.cc:844)
> > ==9765==    by 0xB31149:
> > omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
> > (giopImpl12.cc:438)
> > ==9765==    by 0xB319CB:
> > omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> > (*)(omni::giopStream*)) (giopImpl12.cc:646)
> > ==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
> > ==9765==    by 0xB1DE1E: omni::giopWorker::execute()
> > (giopWorker.cc:77)
> > ==9765==    by 0xAC43A1: omniAsyncWorker::real_run()
> > (invoker.cc:578)
> > 
> > Obviously, the tcpConnection entity is not thread safe. 
> > 
> > I don't know if this problem leads to the aforementioned deadlock
> > issue, however I am a bit confused, since I don't know  if omniOrb
> > library is responsible for such kind of locking or me, as a user of
> > omniOrb library, should handle this manually somehow.
> > 
> > Would you mind to give me some advice.
> > 
> > Thanks, 
> > Damijan Skvarc,
> > Senior Software Engineer
> > Instrumentation Technologies, Slovenia
> _______________________________________________
> omniORB-list mailing list
> omniORB-list at omniorb-support.com
> https://www.omniorb-support.com/mailman/listinfo/omniorb-list

-- 
-- Duncan Grisby --
-- duncan at grisby.org --
-- http://www.grisby.org --

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20210707/6a4a559a/attachment-0001.html>