[omniORB] Zombie threads and connections

Tue Jun 1 12:39:26 BST 2010

Hello.

An app of mine is written in C++ with OmniORB server-side and in Java 
with JacORB client-side; the server is running on Ubuntu 9.04 32-bit.
We use BiDirectionalGIOP to let the client contact the server and reuse 
that connection and provide a keep-alive mechanism.

We are having a problem with dead clients: although I detect them and 
stop trying to communicate with them after they are lost for some time, 
their TCP connection seems to never go away and so does it's dedicated 
thread.

We espect max clients to be around 30 or 40, but a graph of such 
resources show the number of threads and ESTABLISHED TCP connection 
constantly growing over time.
Yesterday we had to restart the application, because it had run out of 
resources, having nearly 400 threads and as many open connections and 
thus not being able to open more.

Examining the core with gdb, nearly all omniORB threads show a stack 
trace like the following:

#0  0xb804a424 in __kernel_vsyscall ()
#1  0xb7683418 in recv () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb75d0bc7 in omni::tcpConnection::Recv (this=0xb0cd7450, 
buf=0x40e90be0, sz=8192, deadline_secs=0, deadline_nanosecs=0) at 
/usr/include/bits/socket2.h:45
#3  0xb758bd7c in omni::giopStream::inputMessage (this=0x40e68d44) at 
../../../../../src/lib/omniORB/orbcore/giopStream.cc:869
#4  0xb75a48ec in omni::giopImpl12::inputNewServerMessage (g=0x40e68d44) 
at ../../../../../src/lib/omniORB/orbcore/giopImpl12.cc:555
#5  0xb75a4a45 in omni::giopImpl12::inputMessageBegin (g=0x40e68d44, 
unmarshalHeader=0xb75a4c70 
<omni::giopImpl12::unmarshalWildCardRequestHeader(omni::giopStream*)>)
     at ../../../../../src/lib/omniORB/orbcore/giopImpl12.cc:768
#6  0xb7594b35 in omni::GIOP_S::dispatcher (this=0x40e68d40) at 
../../../../../src/lib/omniORB/orbcore/GIOP_S.cc:248
#7  0xb7591868 in omni::giopWorker::real_execute (this=0xb63904a8) at 
../../../../../src/lib/omniORB/orbcore/giopWorker.cc:215
#8  0xb7591ecb in omni::giopWorkerInfo::run (this=0xba86d060) at 
../../../../../src/lib/omniORB/orbcore/giopWorker.cc:103
#9  0xb7591f1a in omni::giopWorker::execute (this=0xb63904a8) at 
../../../../../src/lib/omniORB/orbcore/giopWorker.cc:117
#10 0xb7534d30 in omniAsyncWorker::real_run (this=0xb63b6c70) at 
../../../../../src/lib/omniORB/orbcore/invoker.cc:232
#11 0xb75347eb in omniAsyncWorkerInfo::run (this=0xba86d060) at 
../../../../../src/lib/omniORB/orbcore/invoker.cc:280
#12 0xb7534fea in omniAsyncWorker::run (this=0xb63b6c70) at 
../../../../../src/lib/omniORB/orbcore/invoker.cc:159
#13 0xb8038937 in omni_thread_wrapper (ptr=0xb63b6c70) at 
../../../../src/lib/omnithread/posix.cc:456
#14 0xb767c4ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#15 0xb72f449e in clone () from /lib/tls/i686/cmov/libc.so.6

Only one was in poll().

My guess is that a thread in the recv function doesn't get interrupted, 
so it never gets reaped and the connection never closed.

We are using a pretty stock configuration file, the only option we set 
being setClientCallTimeout(5000), in order not to block for too long on 
dead clients when we do oneway calls.

What should I look into? Where should I start examining this?

  bye & Thanks
	av.