[omniORB] deadlock in omniORB

Wed Jul 7 13:05:14 UTC 2021

Thanks for your reply.
regards,
Damijan

On Wed, Jul 7, 2021 at 3:02 PM Duncan Grisby <duncan at grisby.org> wrote:

> Yes, the error you have found was fixed in 4.2.3.
>
> The other thing identified by helgrind is a false positive. The connection
> objects in question are only owned by one thread at a time, so it is
> perfectly fine that different threads access their non-blocking state at
> different times while holding no locks. Only the owning thread touches that
> data, and changes of ownership are properly handled with suitable locks.
>
> Duncan.
>
>
> On Wed, 2021-07-07 at 14:55 +0200, Damijan Skvarc via omniORB-list wrote:
>
> HI guys,
>
> I have continued to work on deadlock issue by compiling omniORB library
> from sources, gathered from
>
> http://sourceforge.net/projects/omniorb/files/omniORB/omniORB-4.2.2/omniORB-4.2.2.tar.bz2
> .
>
> After compiling it with debug information & preventing optimization
> (./configure CXXFLAGS="-g -O0") and after I came back into a deadlock
> situation I was able to get some more information about the reason for the
> deadlock issue. I believe the problem is that the same thread is trying to
> lock the same mutex again.  The following gdb backtrace depict the
> problematic thread:
>
> (gdb) bt
> #0  __lll_lock_wait () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x0000000004e4e025 in __GI___pthread_mutex_lock (mutex=0x66646f0) at
> ../nptl/pthread_mutex_lock.c:80
> #2  0x0000000004c35fd7 in ?? () from
> /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so
> #3  0x000000000085357d in omni_mutex::lock (this=0x66646f0) at
> /usr/include/omnithread.h:255
> #4  0x0000000000a212e4 in omni_mutex_lock::omni_mutex_lock
> (this=0xcc28af0, m=...) at ../../../../include/omnithread.h:299
> #5  0x0000000000b0c009 in omni::giopStream::errorOnReceive
> (this=0x67cbea8, rc=0, filename=0xbcd661 "giopStream.cc", lineno=404,
> buf=0x0, heldlock=true,
>     message=0xbcd718 "Timed out sleeping on read lock (waiting for
> receiver thread)") at giopStream.cc:718
> #6  0x0000000000b0b3b6 in omni::giopStream::sleepOnRdLockAlways
> (this=0x67cbea8) at giopStream.cc:404
> #7  0x0000000000b27e9b in omni::giopImpl12::inputReplyBegin (g=0x67cbea8,
> unmarshalHeader=0xb28a6e
> <omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at
> giopImpl12.cc:557
> #8  0x0000000000b28194 in omni::giopImpl12::inputMessageBegin
> (g=0x67cbea8, unmarshalHeader=0xb28a6e
> <omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at
> giopImpl12.cc:626
> #9  0x0000000000b152bb in omni::GIOP_C::IssueLocateRequest
> (this=0x67cbea0) at GIOP_C.cc:167
> #10 0x0000000000af0f9e in omniRemoteIdentity::locateRequest
> (this=0xa350a00, call_desc=...) at remoteIdentity.cc:206
> #11 0x0000000000af076a in omniRemoteIdentity::dispatch (this=0xa350a00,
> call_desc=...) at remoteIdentity.cc:82
> #12 0x0000000000ad2465 in omniObjRef::_invoke (this=0xa350aa0,
> call_desc=..., do_assert=false) at omniObjRef.cc:675
> ......
>
> In omni::giopImpl12::inputReplyBegin() omniTransportLock is locked (line
> 538) and later on in omni::giopStream::errorOnReceive() function (line
> 718) the same mutex is trying to be locked again.
>
> After this finding I have checked the code on 4.2.3 version and found the
> problem has already been fixed by modifying errorOnReceive() function and
> locking mutex optionally and not unconditionally.
>
> I believe this thread can be closed and with the 4.2.3 version our
> deadlock issue will be overcome.
>
>
> Thanks,
> Damijan
>
> On Mon, Jul 5, 2021 at 3:09 PM Damijan Skvarc <damjan.skvarc at gmail.com>
> wrote:
>
> Hi.
>
> With the omniORB4.2.2 version (delivered by Ubunto 18.04) we have started
> facing a deadlock issue. Once I was able to attach with gdb into the
> process where such a deadlock happened and found the following situation.
> There was 21 threads, where one of the  thread was found in:
>
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f691f213025 in __GI___pthread_mutex_lock (mutex=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f691fc1d6c0 in omni::giopStream::errorOnReceive(int, char const*, unsigned int, omni::giopStream_Buffer*, bool, char const*) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #3  0x00007f691fc1da4b in omni::giopStream::sleepOnRdLockAlways() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #4  0x00007f691fc369ac in omni::giopImpl12::inputReplyBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #5  0x00007f691fc37c17 in omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #6  0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #7  0x00007f691fc047e8 in omniRemoteIdentity::dispatch(omniCallDescriptor&) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
>
> while the other 20 threads were stuck in
>
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f691f214fb3 in __pthread_mutex_cond_lock (mutex=mutex at entry=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f691f216e7d in __pthread_cond_wait_common (abstime=0x7f68c5613eb0, mutex=0x55c31d6f4e00, cond=0x55c31d82b9d8) at pthread_cond_wait.c:645
> #3  __pthread_cond_timedwait (cond=0x55c31d82b9d8, mutex=0x55c31d6f4e00, abstime=0x7f68c5613eb0) at pthread_cond_wait.c:667
> #4  0x00007f691f90ca84 in omni_condition::timedwait(unsigned long, unsigned long) () from /usr/lib/x86_64-linux-gnu/libomnithread.so.4
> #5  0x00007f691fc1d9f1 in omni::giopStream::sleepOnRdLockAlways() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #6  0x00007f691fc369ac in omni::giopImpl12::inputReplyBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #7  0x00007f691fc37c17 in omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #8  0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #9  0x00007f691fc047e8 in omniRemoteIdentity::dispatch(omniCallDescriptor&) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #10 0x00007f691fbe7b3c in omniObjRef::_invoke(omniCallDescriptor&, bool) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2utex, but the call stack
>
> By inspecting pthread mutex at address0x55c31d6f4e00 I found the first
> thread already owned this mutex, but the call stack depicts the thread is
> trying to lock it again. It looks the thread for some reason "forget" to
> unlock it in one of the previous steps.
>
> Since I couldn't get any additional data I have started to instrument the
> code with helgrind. Unfortunately, due to different timing conditions, I
> was not able to come into the same obstacle again. I have instrumented all
> versions from 4.2.0 up to 4.2.4 and in all these versions I've found one
> another raise condition which happened in the following case:
>
> Possible data race during read of size 1 at 0x67C9275 by thread #1
> ==9765== Locks held: none
> ==9765==    at 0xB5ACBA: omni::SocketHolder::setNonBlocking()
> (SocketCollection.h:108)
> ==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:71)
> ==9765==    by 0xB16D19:
> omni::giopStream::sendChunk(omni::giopStream_Buffer*) (giopStream.cc:1129)
> ==9765==    by 0xB3391A:
> omni::giopImpl12::outputMessageEnd(omni::giopStream*) (giopImpl12.cc:1320)
> ==9765==    by 0xB1E881: omni::GIOP_C::InitialiseRequest() (GIOP_C.cc:109)
> ==9765==    by 0xAF9363: omniRemoteIdentity::dispatch(omniCallDescriptor&)
> (remoteIdentity.cc:97)
> ==9765==    by 0xADAF14: omniObjRef::_invoke(omniCallDescriptor&, bool)
> (omniObjRef.cc:675)
> ==9765==    by 0x873DCD:
> i_NET::_objref_RemoteNode::GetFlags(i_NET::NodeNames const&)
> (i_remote_node.cc:1418)
> ==9765==    by 0x82B49C:
> mci::RemoteNode::OnGetFlags(std::vector<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >,
> std::allocator<std::__cxx11::basic_string<char, std::char_
> traits<char>, std::allocator<char> > > > const&) const
> (remote_node.cpp:333)
> ==9765==    by 0x822E00:
> mci::TreeNode::GetFlags(std::vector<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >,
> std::allocator<std::__cxx11::basic_string<char, std::char_trai
> ts<char>, std::allocator<char> > > > const&) const (tree_node.cpp:94)
> ==9765==    by 0x7A6190: mci::Node::GetFlags() const (node.cpp:312)
> ==9765==    by 0x68C0B2: MciProxy::Connect(mci::Root, unsigned int)
> (mciProxy.cpp:56)
> ==9765==
> ==9765== This conflicts with a previous write of size 1 by thread #8
> ==9765== Locks held: none
> ==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
> (SocketCollection.h:101)
> ==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:155)
> ==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
> (giopStream.cc:844)
> ==9765==    by 0xB31149:
> omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
> (giopImpl12.cc:438)
> ==9765==    by 0xB319CB:
> omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> (*)(omni::giopStream*)) (giopImpl12.cc:646)
> ==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
> ==9765==    by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)
> ==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
>
> This raise condition happens between write & read operation, however there
> are also cases where raise happens also between write & write  operation
> what is more problematic. For example:
>
> ==9765== Possible data race during write of size 1 at 0x67C9275 by thread
> #7
> ==9765== Locks held: none
> ==9765==    at 0xB5ACD7: omni::SocketHolder::setNonBlocking()
> (SocketCollection.h:110)
> ==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:71)
> ==9765==    by 0xB10FC6: omni::sendCloseConnection(omni::giopStrand*)
> (giopStrand.cc:121)
> ==9765==    by 0xB12AE4: omni::Scavenger::execute() (giopStrand.cc:698)
> ==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
> ==9765==    by 0xAC5885: omniAsyncPoolGeneral::workerRun(omniAsyncWorker*)
> (invoker.cc:297)
> ==9765==    by 0xAC3F45: omniAsyncWorker::mid_run() (invoker.cc:511)
> ==9765==    by 0xAC4614: omniAsyncWorkerInfo::run() (invoker.cc:662)
> ==9765==    by 0xAC55C1: omniAsyncWorker::run(void*) (invoker.cc:126)
> ==9765==    by 0xB67F33: omni_thread_wrapper (posix.cc:459)
> ==9765==    by 0x4C38C26: ??? (in
> /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
> ==9765==    by 0x4E4B6DA: start_thread (pthread_create.c:463)
> ==9765==
> ==9765== This conflicts with a previous write of size 1 by thread #8
> ==9765== Locks held: none
> ==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
> (SocketCollection.h:101)
> ==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:155)
> ==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
> (giopStream.cc:844)
> ==9765==    by 0xB31149:
> omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
> (giopImpl12.cc:438)
> ==9765==    by 0xB319CB:
> omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> (*)(omni::giopStream*)) (giopImpl12.cc:646)
> ==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
> ==9765==    by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)
> ==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
>
> Obviously, the tcpConnection entity is not thread safe.
>
> I don't know if this problem leads to the aforementioned deadlock issue,
> however I am a bit confused, since I don't know  if omniOrb library is
> responsible for such kind of locking or me, as a user of omniOrb library,
> should handle this manually somehow.
>
> Would you mind to give me some advice.
>
> Thanks,
> Damijan Skvarc,
> Senior Software Engineer
> Instrumentation Technologies, Slovenia
>
> _______________________________________________
> omniORB-list mailing list
> omniORB-list at omniorb-support.com
> https://www.omniorb-support.com/mailman/listinfo/omniorb-list
>
>
> --
>
>  -- Duncan Grisby --
>   -- duncan at grisby.org --
>    -- http://www.grisby.org --
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20210707/7639fc49/attachment-0001.html>