[omniORB] Endless loop in omniORB - need help!

Wernke zur Borg Wernke.zur.Borg at vega.de
Tue Sep 20 10:47:10 BST 2005


Dear omniORB list,

I need your help to solve a problem that has recently occurred with one of
my omniORB applications. 

Environment:
- Solaris 8 on SunBlade 1500
- omniORB 4.0.2

Symptoms:
After hours and days of smooth operation the program ends up in an endless
loop with 100% CPU usage. top shows 45% user and 55% kernel. We have
currently 20-30 instances of this program running 24 hours/7 days on
different machines in an operational environment. The problem is reported
1-3 times per week.

I did gcore to obtain a core file of the running process and loaded it into
a debugger. The program was running with 28 threads, of which 6 were active.
Of the active ones 2 were not blocking on a semaphore or CV. My assumption
is that these 2 threads were competing for CPU. Both were threads started by
omniORB, and they had the following stack:

============================================================================
====
current thread: t at 13
  [1] _so_recv(0x14, 0x391468, 0x2000, 0x0, 0x1, 0x8000), at 0xfeb9c23c
=>[2] omni::tcpConnection::Recv(this = 0x3525a8, buf = 0x391468, sz = 8192U,
deadline_secs = 0, deadline_nanosecs = 0), line 284 in "tcpConnection.cc"
  [3] omni::giopStream::inputMessage(this = 0x3526c4), line 816 in
"giopStream.cc"
  [4] omni::giopImpl12::inputNewServerMessage(g = 0x3526c4), line 459 in
"giopImpl12.cc"
  [5] omni::giopImpl12::inputMessageBegin(g = 0x3526c4, unmarshalHeader =
0xff239d18 =
&omni::giopImpl12::unmarshalWildCardRequestHeader(omni::giopStream*)), line
664 in "giopImpl12.cc"
  [6] omni::GIOP_S::dispatcher(this = 0x3526c0), line 214 in "GIOP_S.cc"
  [7] omni::giopWorker::real_execute(this = 0x35c938), line 212 in
"giopWorker.cc"
  [8] omni::giopWorkerInfo::run(this = 0xfd909b10), line 100 in
"giopWorker.cc"
  [9] omni::giopWorker::execute(this = 0x35c938), line 114 in
"giopWorker.cc"
  [10] omniAsyncWorker::real_run(this = 0x352650), line 182 in "invoker.cc"
  [11] omniAsyncWorkerInfo::run(this = 0xfd909ca8), line 229 in "invoker.cc"
  [12] omniAsyncWorker::run(this = 0x352650, _ARG2 = (nil)), line 133 in
"invoker.cc"
  [13] omni_thread_wrapper(ptr = 0x352650), line 422 in "posix.cc"
============================================================================
====

current thread: t at 1626
  [1] _so_recv(0x16, 0x370598, 0x2000, 0x0, 0x1, 0x8000), at 0xfeb9c23c
=>[2] omni::tcpConnection::Recv(this = 0x35dcd8, buf = 0x370598, sz = 8192U,
deadline_secs = 0, deadline_nanosecs = 0), line 284 in "tcpConnection.cc"
  [3] omni::giopStream::inputMessage(this = 0x372b8c), line 816 in
"giopStream.cc"
  [4] omni::giopImpl12::inputNewServerMessage(g = 0x372b8c), line 459 in
"giopImpl12.cc"
  [5] omni::giopImpl12::inputMessageBegin(g = 0x372b8c, unmarshalHeader =
0xff239d18 =
&omni::giopImpl12::unmarshalWildCardRequestHeader(omni::giopStream*)), line
664 in "giopImpl12.cc"
  [6] omni::GIOP_S::dispatcher(this = 0x372b88), line 214 in "GIOP_S.cc"
  [7] omni::giopWorker::real_execute(this = 0x384db8), line 212 in
"giopWorker.cc"
  [8] omni::giopWorkerInfo::run(this = 0xfbe03b10), line 100 in
"giopWorker.cc"
  [9] omni::giopWorker::execute(this = 0x384db8), line 114 in
"giopWorker.cc"
  [10] omniAsyncWorker::real_run(this = 0x35d118), line 182 in "invoker.cc"
  [11] omniAsyncWorkerInfo::run(this = 0xfbe03ca8), line 229 in "invoker.cc"
  [12] omniAsyncWorker::run(this = 0x35d118, _ARG2 = (nil)), line 133 in
"invoker.cc"
  [13] omni_thread_wrapper(ptr = 0x35d118), line 422 in "posix.cc"
============================================================================
====

There is a do-loop in tcpConnection::Recv(), there is a while-loop in
giopStream::inputMessage(), and there are further loops higher above in the
stack, but to me it is currently unclear where the loop starts. I guess it
is not the do-loop in tcpConnection::Recv() because variable rx is
uninitialised. 

I am not too familiar with the internals of the omniORB core, so perhaps
somebody could help me in the analysis. I have also printed the contents of
both giopStream objects if this could be of any help:

== thread t at 13 ==
*g = {
    pd_strand                     = 0x3748b0
    pd_rdlocked                   = true
    pd_wrlocked                   = false
    pd_impl                       = 0x351060
    pd_deadline_secs              = 0
    pd_deadline_nanosecs          = 0
    pd_currentInputBuffer         = (nil)
    pd_input                      = (nil)
    pd_inputFullyBuffered         = false
    pd_inputMatchedId             = true
    pd_inputExpectAnotherFragment = false
    pd_inputFragmentToCome        = 0
    pd_inputMessageSize           = 5336U
    pd_currentOutputBuffer        = 0x3958a0
    pd_outputFragmentSize         = 0
    pd_outputMessageSize          = 0
    pd_request_id                 = 8100U
    _classid                      = 0
    directSendCutOff              = 16384U
    directReceiveCutOff           = 1024U
    bufferSize                    = 8192U
}

== thread t at 1626 ==
*g = {
    pd_strand                     = 0x3a41e8
    pd_rdlocked                   = true
    pd_wrlocked                   = false
    pd_impl                       = 0x351060
    pd_deadline_secs              = 0
    pd_deadline_nanosecs          = 0
    pd_currentInputBuffer         = (nil)
    pd_input                      = (nil)
    pd_inputFullyBuffered         = false
    pd_inputMatchedId             = true
    pd_inputExpectAnotherFragment = false
    pd_inputFragmentToCome        = 0
    pd_inputMessageSize           = 5336U
    pd_currentOutputBuffer        = 0x3aa5f0
    pd_outputFragmentSize         = 0
    pd_outputMessageSize          = 0
    pd_request_id                 = 2052U
    _classid                      = 0
    directSendCutOff              = 16384U
    directReceiveCutOff           = 1024U
    bufferSize                    = 8192U
}

Isn't it strange that both streams show the same message size? Is is
possible that both threads are trying to read the same TCP message??

It would be very difficult to enable the omniORB traces in the operational
environment, but if nothing else helps, I have to consider that as well. I
could not reproduce the problem in our development environment yet, because
we cannot currently simulate the oprtaional message load.

I would be extremely thankful for any hint or comment that could help in
solving the problem. Thanks to everybody in advance.

---
Wernke zur Borg
VEGA Informations-Technologien GmbH
Robert-Bosch-Str. 7
64293 Darmstadt / Germany
Tel: +49-(0)6151-8257-128




More information about the omniORB-list mailing list