[omniORB] deadlock in omniORB

Damijan Skvarc damjan.skvarc at gmail.com
Wed Jul 7 12:55:07 UTC 2021


HI guys,

I have continued to work on deadlock issue by compiling omniORB library
from sources, gathered from
http://sourceforge.net/projects/omniorb/files/omniORB/omniORB-4.2.2/omniORB-4.2.2.tar.bz2
.

After compiling it with debug information & preventing optimization
(./configure CXXFLAGS="-g -O0") and after I came back into a deadlock
situation I was able to get some more information about the reason for the
deadlock issue. I believe the problem is that the same thread is trying to
lock the same mutex again.  The following gdb backtrace depict the
problematic thread:

(gdb) bt
#0  __lll_lock_wait () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x0000000004e4e025 in __GI___pthread_mutex_lock (mutex=0x66646f0) at
../nptl/pthread_mutex_lock.c:80
#2  0x0000000004c35fd7 in ?? () from
/usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so
#3  0x000000000085357d in omni_mutex::lock (this=0x66646f0) at
/usr/include/omnithread.h:255
#4  0x0000000000a212e4 in omni_mutex_lock::omni_mutex_lock (this=0xcc28af0,
m=...) at ../../../../include/omnithread.h:299
#5  0x0000000000b0c009 in omni::giopStream::errorOnReceive (this=0x67cbea8,
rc=0, filename=0xbcd661 "giopStream.cc", lineno=404, buf=0x0,
heldlock=true,
    message=0xbcd718 "Timed out sleeping on read lock (waiting for receiver
thread)") at giopStream.cc:718
#6  0x0000000000b0b3b6 in omni::giopStream::sleepOnRdLockAlways
(this=0x67cbea8) at giopStream.cc:404
#7  0x0000000000b27e9b in omni::giopImpl12::inputReplyBegin (g=0x67cbea8,
unmarshalHeader=0xb28a6e
<omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at
giopImpl12.cc:557
#8  0x0000000000b28194 in omni::giopImpl12::inputMessageBegin (g=0x67cbea8,
unmarshalHeader=0xb28a6e
<omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at
giopImpl12.cc:626
#9  0x0000000000b152bb in omni::GIOP_C::IssueLocateRequest (this=0x67cbea0)
at GIOP_C.cc:167
#10 0x0000000000af0f9e in omniRemoteIdentity::locateRequest
(this=0xa350a00, call_desc=...) at remoteIdentity.cc:206
#11 0x0000000000af076a in omniRemoteIdentity::dispatch (this=0xa350a00,
call_desc=...) at remoteIdentity.cc:82
#12 0x0000000000ad2465 in omniObjRef::_invoke (this=0xa350aa0,
call_desc=..., do_assert=false) at omniObjRef.cc:675
......

In omni::giopImpl12::inputReplyBegin() omniTransportLock is locked (line
538) and later on in omni::giopStream::errorOnReceive() function (line 718)
the same mutex is trying to be locked again.

After this finding I have checked the code on 4.2.3 version and found the
problem has already been fixed by modifying errorOnReceive() function and
locking mutex optionally and not unconditionally.

I believe this thread can be closed and with the 4.2.3 version our deadlock
issue will be overcome.


Thanks,
Damijan

On Mon, Jul 5, 2021 at 3:09 PM Damijan Skvarc <damjan.skvarc at gmail.com>
wrote:

> Hi.
>
> With the omniORB4.2.2 version (delivered by Ubunto 18.04) we have started
> facing a deadlock issue. Once I was able to attach with gdb into the
> process where such a deadlock happened and found the following situation.
> There was 21 threads, where one of the  thread was found in:
>
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f691f213025 in __GI___pthread_mutex_lock (mutex=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f691fc1d6c0 in omni::giopStream::errorOnReceive(int, char const*, unsigned int, omni::giopStream_Buffer*, bool, char const*) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #3  0x00007f691fc1da4b in omni::giopStream::sleepOnRdLockAlways() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #4  0x00007f691fc369ac in omni::giopImpl12::inputReplyBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #5  0x00007f691fc37c17 in omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #6  0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #7  0x00007f691fc047e8 in omniRemoteIdentity::dispatch(omniCallDescriptor&) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
>
> while the other 20 threads were stuck in
>
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f691f214fb3 in __pthread_mutex_cond_lock (mutex=mutex at entry=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f691f216e7d in __pthread_cond_wait_common (abstime=0x7f68c5613eb0, mutex=0x55c31d6f4e00, cond=0x55c31d82b9d8) at pthread_cond_wait.c:645
> #3  __pthread_cond_timedwait (cond=0x55c31d82b9d8, mutex=0x55c31d6f4e00, abstime=0x7f68c5613eb0) at pthread_cond_wait.c:667
> #4  0x00007f691f90ca84 in omni_condition::timedwait(unsigned long, unsigned long) () from /usr/lib/x86_64-linux-gnu/libomnithread.so.4
> #5  0x00007f691fc1d9f1 in omni::giopStream::sleepOnRdLockAlways() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #6  0x00007f691fc369ac in omni::giopImpl12::inputReplyBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #7  0x00007f691fc37c17 in omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #8  0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #9  0x00007f691fc047e8 in omniRemoteIdentity::dispatch(omniCallDescriptor&) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
> #10 0x00007f691fbe7b3c in omniObjRef::_invoke(omniCallDescriptor&, bool) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2utex, but the call stack
>
> By inspecting pthread mutex at address0x55c31d6f4e00 I found the first
> thread already owned this mutex, but the call stack depicts the thread is
> trying to lock it again. It looks the thread for some reason "forget" to
> unlock it in one of the previous steps.
>
> Since I couldn't get any additional data I have started to instrument the
> code with helgrind. Unfortunately, due to different timing conditions, I
> was not able to come into the same obstacle again. I have instrumented all
> versions from 4.2.0 up to 4.2.4 and in all these versions I've found one
> another raise condition which happened in the following case:
>
> Possible data race during read of size 1 at 0x67C9275 by thread #1
> ==9765== Locks held: none
> ==9765==    at 0xB5ACBA: omni::SocketHolder::setNonBlocking()
> (SocketCollection.h:108)
> ==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:71)
> ==9765==    by 0xB16D19:
> omni::giopStream::sendChunk(omni::giopStream_Buffer*) (giopStream.cc:1129)
> ==9765==    by 0xB3391A:
> omni::giopImpl12::outputMessageEnd(omni::giopStream*) (giopImpl12.cc:1320)
> ==9765==    by 0xB1E881: omni::GIOP_C::InitialiseRequest() (GIOP_C.cc:109)
> ==9765==    by 0xAF9363: omniRemoteIdentity::dispatch(omniCallDescriptor&)
> (remoteIdentity.cc:97)
> ==9765==    by 0xADAF14: omniObjRef::_invoke(omniCallDescriptor&, bool)
> (omniObjRef.cc:675)
> ==9765==    by 0x873DCD:
> i_NET::_objref_RemoteNode::GetFlags(i_NET::NodeNames const&)
> (i_remote_node.cc:1418)
> ==9765==    by 0x82B49C:
> mci::RemoteNode::OnGetFlags(std::vector<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >,
> std::allocator<std::__cxx11::basic_string<char, std::char_
> traits<char>, std::allocator<char> > > > const&) const
> (remote_node.cpp:333)
> ==9765==    by 0x822E00:
> mci::TreeNode::GetFlags(std::vector<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> >,
> std::allocator<std::__cxx11::basic_string<char, std::char_trai
> ts<char>, std::allocator<char> > > > const&) const (tree_node.cpp:94)
> ==9765==    by 0x7A6190: mci::Node::GetFlags() const (node.cpp:312)
> ==9765==    by 0x68C0B2: MciProxy::Connect(mci::Root, unsigned int)
> (mciProxy.cpp:56)
> ==9765==
> ==9765== This conflicts with a previous write of size 1 by thread #8
> ==9765== Locks held: none
> ==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
> (SocketCollection.h:101)
> ==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:155)
> ==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
> (giopStream.cc:844)
> ==9765==    by 0xB31149:
> omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
> (giopImpl12.cc:438)
> ==9765==    by 0xB319CB:
> omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> (*)(omni::giopStream*)) (giopImpl12.cc:646)
> ==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
> ==9765==    by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)
> ==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
>
> This raise condition happens between write & read operation, however there
> are also cases where raise happens also between write & write  operation
> what is more problematic. For example:
>
> ==9765== Possible data race during write of size 1 at 0x67C9275 by thread
> #7
> ==9765== Locks held: none
> ==9765==    at 0xB5ACD7: omni::SocketHolder::setNonBlocking()
> (SocketCollection.h:110)
> ==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:71)
> ==9765==    by 0xB10FC6: omni::sendCloseConnection(omni::giopStrand*)
> (giopStrand.cc:121)
> ==9765==    by 0xB12AE4: omni::Scavenger::execute() (giopStrand.cc:698)
> ==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
> ==9765==    by 0xAC5885: omniAsyncPoolGeneral::workerRun(omniAsyncWorker*)
> (invoker.cc:297)
> ==9765==    by 0xAC3F45: omniAsyncWorker::mid_run() (invoker.cc:511)
> ==9765==    by 0xAC4614: omniAsyncWorkerInfo::run() (invoker.cc:662)
> ==9765==    by 0xAC55C1: omniAsyncWorker::run(void*) (invoker.cc:126)
> ==9765==    by 0xB67F33: omni_thread_wrapper (posix.cc:459)
> ==9765==    by 0x4C38C26: ??? (in
> /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
> ==9765==    by 0x4E4B6DA: start_thread (pthread_create.c:463)
> ==9765==
> ==9765== This conflicts with a previous write of size 1 by thread #8
> ==9765== Locks held: none
> ==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
> (SocketCollection.h:101)
> ==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long,
> omni_time_t const&) (tcpConnection.cc:155)
> ==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
> (giopStream.cc:844)
> ==9765==    by 0xB31149:
> omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
> (giopImpl12.cc:438)
> ==9765==    by 0xB319CB:
> omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
> (*)(omni::giopStream*)) (giopImpl12.cc:646)
> ==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
> ==9765==    by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)
> ==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
>
> Obviously, the tcpConnection entity is not thread safe.
>
> I don't know if this problem leads to the aforementioned deadlock issue,
> however I am a bit confused, since I don't know  if omniOrb library is
> responsible for such kind of locking or me, as a user of omniOrb library,
> should handle this manually somehow.
>
> Would you mind to give me some advice.
>
> Thanks,
> Damijan Skvarc,
> Senior Software Engineer
> Instrumentation Technologies, Slovenia
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20210707/9122eb00/attachment.html>


More information about the omniORB-list mailing list