[omniORB] omniNames crashes on OpenVMS Alpha with multiple kernel threads enabled

06 Jun 2000 15:22:29 +0100

Bruce,

I looked at the code for a good few hours and could not find any race
condition. But then I wrote it so I could be blinded to its defects.

Perhaps you could try this patch:

Index: strand.cc
===================================================================
RCS file: /project/omni/cvsroot/omni/src/lib/omniORB2/orbcore/strand.cc,v
retrieving revision 1.10.2.3
diff -r1.10.2.3 strand.cc
95a96
> #define CHECK_FOR_RACE_CONDITION 0x5246494E
166a168,179
> 
> #ifdef CHECK_FOR_RACE_CONDITION
>   if (pd_refcount == 0 && pd_ripper_next != this && 
>       (omni::ptr_arith_t) pd_ripper_next != CHECK_FOR_RACE_CONDITION) {
> 
>     // This strand has been handed to the ripper thread. The ref count
>     // should not go to 0 before the ripper thread has processed it.
>     LOGMESSAGE(0,"Strand::decrRefCount: unexpected race condition. Abort program");
>     abort();
>   }
> #endif
> 
705a719,721
> #ifdef CHECK_FOR_RACE_CONDITION
> 	p->pd_ripper_next = (Strand*) (omni::ptr_arith_t) CHECK_FOR_RACE_CONDITION;
> #endif	

What the patch does is to trap the condition when a strand has been handed
to the ripper thread but before the ripper gets to it, another thread
calls decrRefCount and causes the ref count to go to 0. This seems to be
what you are getting with your most recent crash. This shouldn't
happen but if it does at least we have a core dump to look at who
is doing what at the time.

Also, Duncan points out that there is a potential problem with
tcpSocketRendezvouser::run_detached() under a very rare exception condition.
I'm not sure if such an exception condition exists but then its better to
plug that hole:

Index: tcpSocketMTfactory.cc
===================================================================
RCS file: /project/omni/cvsroot/omni/src/lib/omniORB2/orbcore/tcpSocketMTfactory.cc,v
retrieving revision 1.22.2.7
diff -r1.22.2.7 tcpSocketMTfactory.cc
1145c1145
<   tcpSocketStrand *newSt = 0;
---
>   tcpSocketStrand *newSt;
1149a1150,1151
> 
>     newSt = 0;

Keep me posted if you have any new evidence.

Sai-Lai

>>>>> Bruce Visscher writes:

> I have been experiencing some extremely rare problems with strands being
> deleted early in omniNames 2.8.0 on OpenVMS Alpha 7.1-2 (with DECthreads
> ECOs) if multiple kernel threads is enabled.

> I had reported this problem earlier, but stated that I still had some
> VMS "issues".  I have since resolved these issues and they had nothing
> to do with the crashes I have seen in omniNames (we uncovered a bug in
> the std iostreams library that caused it to be thread unsafe).

> Specifically, the problems always seem to occur in Strand::decrRefCount.

> I have seen this assertion fail:

>   assert(pd_refcount >= 0);

> I have also had access violations (address=0x00000000) in the statement:

pd_rope-> pd_lock.lock();

> This occurs in the call to pthread_mutex_lock.  I believe this indicates
> that the pd_rope member is null which indicates that the strand has
> already been destroyed (there's a pd_rope=0 in the destructor, perhaps
> because of MSC double destruction bugs).

> The most recent example of this occurred when
> omniORB_Ripper::run_undetached invoked p->decrRefCount() just after the
p-> real_shutdown().

> The problem doesn't seem to occur if multiple kernel threads is
> disabled.  However, my experience has been that enabling this option
> seems to expose thread safety issues.

> Unfortunately, this problem has proven to be extremely difficult to
> reproduce.  I have to run omniNames with multiple dedicated clients
> beating on it for hours to get it to crash.  No pattern has emerged.

> Does anyone have any clues?

-- 
Sai-Lai Lo                                   S.Lo@uk.research.att.com
AT&T Laboratories Cambridge           WWW:   http://www.uk.research.att.com 
24a Trumpington Street                Tel:   +44 1223 343000
Cambridge CB2 1QA                     Fax:   +44 1223 313542
ENGLAND