[omniORB] Problem with omniORB 303 and HP_UX 11

Wed May 29 18:00:06 2002

I've been investigating this problem and I think I've found a deadlock bug.
Anyone care to review my logic?

File:
  omni/src/lib/omniORB2/orbcore/tcpSocketMTfactory.cc

On line 1448 we find...
>>  omni_mutex_lock sync(pd_factory->pd_shutdown_lock);

A few lines further down (line 1471) we find...
>>    omni_mutex_lock sync(pd_factory->pd_shutdown_lock);

The first 'sync' variable hasn't yet gone out of scope, so the second 'sync'
will inevitably deadlock. Evidence that this is actually happening is
provided by the traces supplied by Huw (below). The last thing we see is...
>  omniORB: tcpSocketStrand::real_shutdown() fd no. 9 Done

"real_shutdown()" is called on the line immediately before the second mutex
lock.

Here's the relevant code:

>  while (r->pd_shutdown == tcpSocketIncomingRope::ACTIVE  && !die) {
>    try {

       .
       .
       .

>      PTRACE("Rendezvouser","accept new strand.");
>
>>     omni_mutex_lock sync(pd_factory->pd_shutdown_lock);
>      if (pd_factory->pd_shutdown_nthreads >= 0) {
>        pd_factory->pd_shutdown_nthreads++;
>      }
>      else {
>        pd_factory->pd_shutdown_nthreads--;
>      }
>
>      try {
>        newthr = new tcpSocketWorker(newSt,pd_factory);
>      }
>      catch(...) {
>        newthr = 0;
>      }
>      if (!newthr) {
>        // Cannot create a new thread to serve the strand
>        // We have no choice but to shutdown the strand.
>        // The long term solutions are:  start multiplexing the new strand
>        // and the rendezvous; close down idle connections; reasign
>        // threads to strands; etc.
>        newSt->decrRefCount();
>        newSt->real_shutdown();
>
>>       omni_mutex_lock sync(pd_factory->pd_shutdown_lock);
>        OMNIORB_ASSERT(pd_factory->pd_shutdown_nthreads != 0);
>        if (pd_factory->pd_shutdown_nthreads > 0) {
>          pd_factory->pd_shutdown_nthreads--;
>        }
>        else {
>          pd_factory->pd_shutdown_nthreads++;
>          pd_factory->pd_shutdown_cond.signal();
>        }
>
>      }
>    }
>    catch(

This is a bug or I'll eat my hat! I found the above code in OmniORB v3.0.4.

If it is a bug, then how should we fix it? The least intrusive change would
be simply to remove the second lock. But what about placing the first one
inside a {...} code block, so that is released earlier?

-Alex

--
"Apple has two performance headaches right now: the processor and the OS." -
The Register.
 Alex Tingle
 Technical Architect
 Aspentech Watford
 +44 7901 552763

> -----Original Message-----
> From: Huw Edwards 
> Sent: 27 May 2002 09:53
> To: 'omniorb-list@realvnc.com'
> Subject: [omniORB] Problem with omniORB 303 and HP_UX 11
> 
>> Hi 
>> 
>> We are getting an intermittant problem using omniORB303 on an 
>> HP-UX 11.00 machine - we have a server that 3 different clients attach 
>> to via omniNames. Sometimes the 3rd application will fail to attach - it 
>> receives the CORBA::Object from the naming service and narrows 
>> it to be a server object but fails (COMM_FAILURE) when it tries to 
>> call a method on the server. If this happens any new client trying to 
>> attach will also fail but clients already attached continue 
>> communicating with the server as normal.
>> 
>> Given that clients already attached are OK and new clients fail we 
>> thought the IOR entry in omniNames was somehow corrupting
>> but that seems not to change.
>> 
>> Setting the ORBtraceLevel at 20 in the server the output is - 
>> 
>> <<When it works>>
>> omniORB: scavenger : scanning connections
>> omniORB: tcpSocketMTfactory Rendezvouser: unblock from accept()
>> omniORB: tcpSocketMTfactory Rendezvouser: accept new strand.
>> omniORB: tcpSocketMTfactory Worker: start.
>> omniORB: tcpSocketMTfactory Rendezvouser: block on accept()
>> omniORB: Handling a GIOP LOCATE_REQUEST.
>> 
>> <<When it fails>>
>> omniORB: scavenger : scanning connections
>> omniORB: tcpSocketMTfactory Rendezvouser: unblock from accept()
>> omniORB: tcpSocketMTfactory Rendezvouser: accept new strand.
>> omniORB: tcpSocketStrand::real_shutdown() fd no. 9 Done
>> omniORB: scavenger : scanning connections
>> 
>> 
>> which I guess suggests that 
>> 	newthr = new tcpSocketWorker(newSt,pd_factory);
>> fails - is that guess correct and what could cause it to fail ?
>> 
>> The same application works consistently on DEC-Alpha/OSF1 machines.
>> 
>> In the search archives Sai-Lo (sp?) was reporting problems with
>> the behaviour of threads on HP-UX (unfortunately I can't get
>> into the archives to give a reference) - has there been any progress
>> on the threads problem like patches from HP and would it be useful 
>> going to 3.0.4?
>> 
>> Regards 
>> Huw
>> 
>> ps obviously there a load of messages before the 2 snippets 
>> I've included which I can supply if necessary but I wanted to
>> keep this short.