[omniORB] Fwd: deadlock in omniORB

Damijan Skvarc damjan.skvarc at gmail.com
Mon Jul 5 14:49:05 UTC 2021


---------- Forwarded message ---------
Od: Damijan Skvarc <damjan.skvarc at gmail.com>
Date: pon., 5. jul. 2021 15:09
Subject: deadlock in omniORB
To: <omniorb-list at omniorb-support.com>, damjan.skvarc <
Damjan.Skvarc at gmail.com>


Hi.

With the omniORB4.2.2 version (delivered by Ubunto 18.04) we have started
facing a deadlock issue. Once I was able to attach with gdb into the
process where such a deadlock happened and found the following situation.
There was 21 threads, where one of the  thread was found in:

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f691f213025 in __GI___pthread_mutex_lock
(mutex=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f691fc1d6c0 in omni::giopStream::errorOnReceive(int, char
const*, unsigned int, omni::giopStream_Buffer*, bool, char const*) ()
from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#3  0x00007f691fc1da4b in omni::giopStream::sleepOnRdLockAlways() ()
from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#4  0x00007f691fc369ac in
omni::giopImpl12::inputReplyBegin(omni::giopStream*, void
(*)(omni::giopStream*)) () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#5  0x00007f691fc37c17 in
omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
(*)(omni::giopStream*)) () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#6  0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#7  0x00007f691fc047e8 in
omniRemoteIdentity::dispatch(omniCallDescriptor&) () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2

while the other 20 threads were stuck in

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f691f214fb3 in __pthread_mutex_cond_lock
(mutex=mutex at entry=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f691f216e7d in __pthread_cond_wait_common
(abstime=0x7f68c5613eb0, mutex=0x55c31d6f4e00, cond=0x55c31d82b9d8) at
pthread_cond_wait.c:645
#3  __pthread_cond_timedwait (cond=0x55c31d82b9d8,
mutex=0x55c31d6f4e00, abstime=0x7f68c5613eb0) at
pthread_cond_wait.c:667
#4  0x00007f691f90ca84 in omni_condition::timedwait(unsigned long,
unsigned long) () from /usr/lib/x86_64-linux-gnu/libomnithread.so.4
#5  0x00007f691fc1d9f1 in omni::giopStream::sleepOnRdLockAlways() ()
from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#6  0x00007f691fc369ac in
omni::giopImpl12::inputReplyBegin(omni::giopStream*, void
(*)(omni::giopStream*)) () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#7  0x00007f691fc37c17 in
omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
(*)(omni::giopStream*)) () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#8  0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#9  0x00007f691fc047e8 in
omniRemoteIdentity::dispatch(omniCallDescriptor&) () from
/usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#10 0x00007f691fbe7b3c in omniObjRef::_invoke(omniCallDescriptor&,
bool) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2utex, but the
call stack

By inspecting pthread mutex at address0x55c31d6f4e00 I found the first
thread already owned this mutex, but the call stack depicts the thread is
trying to lock it again. It looks the thread for some reason "forget" to
unlock it in one of the previous steps.

Since I couldn't get any additional data I have started to instrument the
code with helgrind. Unfortunately, due to different timing conditions, I
was not able to come into the same obstacle again. I have instrumented all
versions from 4.2.0 up to 4.2.4 and in all these versions I've found one
another raise condition which happened in the following case:

Possible data race during read of size 1 at 0x67C9275 by thread #1
==9765== Locks held: none
==9765==    at 0xB5ACBA: omni::SocketHolder::setNonBlocking()
(SocketCollection.h:108)
==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long,
omni_time_t const&) (tcpConnection.cc:71)
==9765==    by 0xB16D19:
omni::giopStream::sendChunk(omni::giopStream_Buffer*) (giopStream.cc:1129)
==9765==    by 0xB3391A:
omni::giopImpl12::outputMessageEnd(omni::giopStream*) (giopImpl12.cc:1320)
==9765==    by 0xB1E881: omni::GIOP_C::InitialiseRequest() (GIOP_C.cc:109)
==9765==    by 0xAF9363: omniRemoteIdentity::dispatch(omniCallDescriptor&)
(remoteIdentity.cc:97)
==9765==    by 0xADAF14: omniObjRef::_invoke(omniCallDescriptor&, bool)
(omniObjRef.cc:675)
==9765==    by 0x873DCD:
i_NET::_objref_RemoteNode::GetFlags(i_NET::NodeNames const&)
(i_remote_node.cc:1418)
==9765==    by 0x82B49C:
mci::RemoteNode::OnGetFlags(std::vector<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::__cxx11::basic_string<char, std::char_
traits<char>, std::allocator<char> > > > const&) const (remote_node.cpp:333)
==9765==    by 0x822E00:
mci::TreeNode::GetFlags(std::vector<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::__cxx11::basic_string<char, std::char_trai
ts<char>, std::allocator<char> > > > const&) const (tree_node.cpp:94)
==9765==    by 0x7A6190: mci::Node::GetFlags() const (node.cpp:312)
==9765==    by 0x68C0B2: MciProxy::Connect(mci::Root, unsigned int)
(mciProxy.cpp:56)
==9765==
==9765== This conflicts with a previous write of size 1 by thread #8
==9765== Locks held: none
==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
(SocketCollection.h:101)
==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long,
omni_time_t const&) (tcpConnection.cc:155)
==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
(giopStream.cc:844)
==9765==    by 0xB31149:
omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
(giopImpl12.cc:438)
==9765==    by 0xB319CB:
omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
(*)(omni::giopStream*)) (giopImpl12.cc:646)
==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
==9765==    by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)
==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)

This raise condition happens between write & read operation, however there
are also cases where raise happens also between write & write  operation
what is more problematic. For example:

==9765== Possible data race during write of size 1 at 0x67C9275 by thread #7
==9765== Locks held: none
==9765==    at 0xB5ACD7: omni::SocketHolder::setNonBlocking()
(SocketCollection.h:110)
==9765==    by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long,
omni_time_t const&) (tcpConnection.cc:71)
==9765==    by 0xB10FC6: omni::sendCloseConnection(omni::giopStrand*)
(giopStrand.cc:121)
==9765==    by 0xB12AE4: omni::Scavenger::execute() (giopStrand.cc:698)
==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)
==9765==    by 0xAC5885: omniAsyncPoolGeneral::workerRun(omniAsyncWorker*)
(invoker.cc:297)
==9765==    by 0xAC3F45: omniAsyncWorker::mid_run() (invoker.cc:511)
==9765==    by 0xAC4614: omniAsyncWorkerInfo::run() (invoker.cc:662)
==9765==    by 0xAC55C1: omniAsyncWorker::run(void*) (invoker.cc:126)
==9765==    by 0xB67F33: omni_thread_wrapper (posix.cc:459)
==9765==    by 0x4C38C26: ??? (in
/usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==9765==    by 0x4E4B6DA: start_thread (pthread_create.c:463)
==9765==
==9765== This conflicts with a previous write of size 1 by thread #8
==9765== Locks held: none
==9765==    at 0xB5ACA2: omni::SocketHolder::setBlocking()
(SocketCollection.h:101)
==9765==    by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long,
omni_time_t const&) (tcpConnection.cc:155)
==9765==    by 0xB15CA6: omni::giopStream::inputMessage()
(giopStream.cc:844)
==9765==    by 0xB31149:
omni::giopImpl12::inputNewServerMessage(omni::giopStream*)
(giopImpl12.cc:438)
==9765==    by 0xB319CB:
omni::giopImpl12::inputMessageBegin(omni::giopStream*, void
(*)(omni::giopStream*)) (giopImpl12.cc:646)
==9765==    by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)
==9765==    by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)
==9765==    by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)

Obviously, the tcpConnection entity is not thread safe.

I don't know if this problem leads to the aforementioned deadlock issue,
however I am a bit confused, since I don't know  if omniOrb library is
responsible for such kind of locking or me, as a user of omniOrb library,
should handle this manually somehow.

Would you mind to give me some advice.

Thanks,
Damijan Skvarc,
Senior Software Engineer
Instrumentation Technologies, Slovenia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.omniorb-support.com/pipermail/omniorb-list/attachments/20210705/ae4f03aa/attachment.html>


More information about the omniORB-list mailing list