[omniORB] Assertion failed

Sai-Lai Lo S.Lo@uk.research.att.com
27 Jul 1999 11:08:44 +0100


--=-=-=

Helmut,

Bruce Visscher discovered another race condition in the scavenger a while
ago.  Together we've worked out a solution and hopefully is the last one to
fix (finger crossed).

The change is in omniORB-2.8.0pre1. I've attached below a patch that you
can apply to 2.7.1 Please test this and report if your problem is fixed.

As the fix involves changes to the public interfaces, you have to
recompile the whole runtime <top>/lib/* and all your stubs and
applications, just to be sure.

Sai-Lai

>>>>> Helmut Swaczinna writes:

> my client process (outmgr) crashed with this message:

> outmgr: ../strand.cc:280: void Strand::Sync::RdUnlock(bool = 0): Assertion
> `pd_strand->pd_rd_nwaiting < 0' failed.

> What does this mean? What might I've done wrong? 

> The process works almost all the time, but sometimes it chrashes. 
> This process is one of 12 omni processes running on one machine. Most of them
> are combined client-servers and multi-threaded. Some of the others had
> crashed 
> too in the past, but not as often as this one. I've seen the error message
> shown 
> above for the first time now. Maybe it can lead me to the cause of these
> crashes.

> I'm using omniORB 2.7.1 unter Linux 2.0.36 with egcs 1.1.1.


--=-=-=
Content-Type: text/x-patch
Content-Disposition: attachment; filename=strand.patch

diff -cr omniORB_2.7.1/include/omniORB2/rope.h omniORB_2.7.1_patch/include/omniORB2/rope.h
*** omniORB_2.7.1/include/omniORB2/rope.h	Mon Jan 25 18:06:24 1999
--- omniORB_2.7.1_patch/include/omniORB2/rope.h	Tue Jul 27 10:24:14 1999
***************
*** 3,9 ****
  // rope.h                     Created on: 6/2/96
  //                            Author    : Sai Lai Lo (sll)
  //
! //    Copyright (C) 1996, 1997 Olivetti & Oracle Research Laboratory
  //
  //    This file is part of the omniORB library
  //
--- 3,9 ----
  // rope.h                     Created on: 6/2/96
  //                            Author    : Sai Lai Lo (sll)
  //
! //    Copyright (C) 1996-1999 AT&T Laboratories Cambridge
  //
  //    This file is part of the omniORB library
  //
***************
*** 29,34 ****
--- 29,49 ----
  
  /*
    $Log: rope.h,v $
+   Revision 1.9  1999/06/18 21:17:31  sll
+   Updated copyright notice.
+ 
+   Revision 1.8  1999/05/26 11:46:30  sll
+   Replaced WrTimedLock with WrTestLock.
+   Changed the operator() of Strand_iterator to increment the ref count
+   of the strand it returns.
+   Added new member is_unused() to class Strand. This is used to check
+   if no Sync object is parking on the strand. This member is necessary
+   because given the change in the Strand_iterator, is_idle() is no
+   longer appropriate to test if no Sync object is using the strand.
+ 
+   Revision 1.7  1999/03/17 12:43:40  djr
+   Corrected error in Rope_var copy constructor.
+ 
    Revision 1.6  1999/01/25 18:06:24  sll
    Added comment on the side-effect of WrTimedLock and WrUnlock(Strand*).
  
***************
*** 333,338 ****
--- 348,366 ----
    //      Restore <MUTEX> to the same state as indicated by held_rope_mutex
  
    _CORBA_Boolean is_idle(_CORBA_Boolean held_rope_mutex = 0);
+   // Return TRUE(1) if the reference count is zero.
+   //
+   // Concurrency Control:
+   //      MUTEX = pd_rope->pd_lock
+   // Pre-condition:
+   //      Does not hold <MUTEX> on enter if held_rope_mutex == FALSE
+   //      Hold <MUTEX> on enter if held_rope_mutex == TRUE              
+   // Post-condition:
+   //      Restore <MUTEX> to the same state as indicated by held_rope_mutex
+ 
+   _CORBA_Boolean is_unused(_CORBA_Boolean held_rope_mutex = 0);
+   // Return TRUE(1) if  !is_idle() && no Sync object is parking on this strand
+   //
    // Concurrency Control:
    //      MUTEX = pd_rope->pd_lock
    // Pre-condition:
***************
*** 408,419 ****
      // returns false, then the connection to the remote end is really
      // broken and may be considered as a hard failure.
  
  
-     static _CORBA_Boolean WrTimedLock(Strand* s,
- 				      _CORBA_Boolean& heartbeat,
- 				      unsigned long secs,
- 				      unsigned long nanosecs);
-     static void WrUnlock(Strand* s);
    protected:
      void RdLock(_CORBA_Boolean held_rope_mutex=0);
      void WrLock(_CORBA_Boolean held_rope_mutex=0,
--- 436,443 ----
      // returns false, then the connection to the remote end is really
      // broken and may be considered as a hard failure.
  
+     static _CORBA_Boolean WrTestLock(Strand*s,_CORBA_Boolean& heartbeat);
  
    protected:
      void RdLock(_CORBA_Boolean held_rope_mutex=0);
      void WrLock(_CORBA_Boolean held_rope_mutex=0,
***************
*** 430,478 ****
      //      For RdLock(), WrLock(), RdUnlock(), WrUnlock():
      //          Does not hold <MUTEX> on enter if held_rope_mutex == FALSE
      //          Hold <MUTEX> on enter if held_rope_mutex == TRUE
!     //      For WrTimedLock(), WrUnlock(Strand*):
      //          Must hold <MUTEX> on enter
      // Post-condition:
      //      For RdLock(), WrLock(), RdUnlock(), WrUnlock():
      //        Restore <MUTEX> to the same state as indicated by held_rope_mutex
!     //      For WrTimedLock(), WrUnlock(Strand*):
      //        Still held <MUTEX> on exit
-     // XXX  For WrTimedLock(), if the write lock has been acquired, the
-     //                         reference count of the strand is increment by 1.
-     // XXX  For WrUnlock(Strand*), the reference count of the strand is
-     //                         decrement by 1.
-     //
-     // There are two ways to acquire a Write Lock, 
-     //    i.e. WrLock and WrTimedLock.
      //
      // WrLock blocks until it has acquired a write lock on the strand.
!     // WrTimedLock tries to acquire a write lock on the strand until the
!     // current time is later than the absolute time given in the arguments
!     // <secs> and <nanosecs>. These time arguments are interpreted in the
!     // same way as omni_condition::timedwait(). If WrTimedLock returns 1,
!     // the write lock has been acquired, otherwise it has timeout and no
!     // lock has been acquired.
      //
      // A strand has a status boolean known as the 'heartbeat', this boolean
      // value may be set/unset and read as a side-effect of WrLock and 
!     // WrTimeLock. The initial value of the boolean is 0.
      //
      // If <clear_heartbeat> is 1 (the default), WrLock will set the heartbeat
      // boolean to 0 after it has acquired the write lock. If <clear_heartbeat>
      // is 0, WrLock will leave the heartbeat boolean unchanged.
      // 
!     // The heartbeat boolean will be updated by WrTimedLock with the value
!     // of <heartbeat> after it has acquired the write lock. The original value
!     // of the boolean is returned in <heartbeat>.
      //
      // The value of the heartbeat boolean *does not* affect the internal
      // functions of the strand. However, it is used as a way to detect
      // whether a strand is idle and can be closed down. The algorithm is as
      // follows:
!     //      A scavenger thread periodically scans all the strands. It acquires
!     //      the write lock on the strand using WrTimedLock. It sets the
!     //      heartbeat boolean to 1 and examines the original value returned by 
!     //      WrTimedLock. If the original value is also 1, it knows the strand 
      //      has been idle for at least one scan period. It may then shutdown
      //      the strand.
      // 
--- 454,491 ----
      //      For RdLock(), WrLock(), RdUnlock(), WrUnlock():
      //          Does not hold <MUTEX> on enter if held_rope_mutex == FALSE
      //          Hold <MUTEX> on enter if held_rope_mutex == TRUE
!     //      For WrTestLock()
      //          Must hold <MUTEX> on enter
      // Post-condition:
      //      For RdLock(), WrLock(), RdUnlock(), WrUnlock():
      //        Restore <MUTEX> to the same state as indicated by held_rope_mutex
!     //      For WrTestLock()
      //        Still held <MUTEX> on exit
      //
      // WrLock blocks until it has acquired a write lock on the strand.
!     // WrTestLock checks if a write lock is currently held on the strand.
!     // It returns 1 if no write lock is held or 0 otherwise.
      //
      // A strand has a status boolean known as the 'heartbeat', this boolean
      // value may be set/unset and read as a side-effect of WrLock and 
!     // WrTestLock. The initial value of the boolean is 0.
      //
      // If <clear_heartbeat> is 1 (the default), WrLock will set the heartbeat
      // boolean to 0 after it has acquired the write lock. If <clear_heartbeat>
      // is 0, WrLock will leave the heartbeat boolean unchanged.
      // 
!     // The heartbeat boolean will be updated by WrTestLock with the value
!     // of <heartbeat> if no write lock is held on the strand. The original 
!     // value of the boolean is returned in <heartbeat>.
      //
      // The value of the heartbeat boolean *does not* affect the internal
      // functions of the strand. However, it is used as a way to detect
      // whether a strand is idle and can be closed down. The algorithm is as
      // follows:
!     //      A scavenger thread periodically scans all the strands. It tests
!     //      the write lock on the strand using WrTestLock. It sets the
!     //     heartbeat boolean to 1 and examines the original value returned by 
!     //      WrTestLock. If the original value is also 1, it knows the strand 
      //      has been idle for at least one scan period. It may then shutdown
      //      the strand.
      // 
***************
*** 623,632 ****
    //	  Hold <MUTEX> on exit if pd_leave_mutex == TRUE
  
    Strand *operator() ();
  
  private:
!   const Rope   *pd_rope;
    _CORBA_Boolean pd_leave_mutex;
    Strand *pd_s;
    Strand_iterator();
  };
--- 636,649 ----
    //	  Hold <MUTEX> on exit if pd_leave_mutex == TRUE
  
    Strand *operator() ();
+   // Return the next Strand. The reference count of the returned strand
+   // has been incremented by 1. In the next call to operator(), or in the
+   // dtor of Strand_iterator, the reference count will be decremented.
  
  private:
!   const Rope    *pd_rope;
    _CORBA_Boolean pd_leave_mutex;
+   _CORBA_Boolean pd_initialised;
    Strand *pd_s;
    Strand_iterator();
  };
***************
*** 816,825 ****
    }
  
    inline Rope_var(const Rope_var& p) {
-     if (_ptr) {
-       _ptr->decrRefCount();
-       _ptr = 0;
-     }
      if (p._ptr) {
        p._ptr->incrRefCount();
      }
--- 833,838 ----
diff -cr omniORB_2.7.1/src/lib/omniORB2/orbcore/scavenger.cc omniORB_2.7.1_patch/src/lib/omniORB2/orbcore/scavenger.cc
*** omniORB_2.7.1/src/lib/omniORB2/orbcore/scavenger.cc	Thu Mar 11 16:31:30 1999
--- omniORB_2.7.1_patch/src/lib/omniORB2/orbcore/scavenger.cc	Tue Jul 27 10:24:15 1999
***************
*** 28,33 ****
--- 28,36 ----
   
  /*
    $Log: scavenger.cc,v $
+   Revision 1.7  1999/05/26 11:55:33  sll
+   Use WrTestLock instead of the obsoluted WrTimedLock.
+ 
    Revision 1.6  1999/03/11 16:25:55  djr
    Updated copyright notice
  
***************
*** 165,171 ****
--- 168,177 ----
    if (outScavenger) {
      outScavenger->setIsDying();
      outScavenger->poke();
+ #if defined(__sunos__) && defined(__sparc__) && __OSVERSION__ >= 5
+ #else
      outScavenger->join(0);
+ #endif
      outScavenger = 0;
    }
  }
***************
*** 311,318 ****
  	  // Set the heartbeat boolean to 1 and get back its value
  	  // prior to this update.
  	  CORBA::Boolean heartbeat = 1;
! 	  omni_thread::get_time(&abs_sec,&abs_nsec);
! 	  if (Strand::Sync::WrTimedLock(s,heartbeat,abs_sec,abs_nsec))
  	    {
  	      if (heartbeat) 
  		{
--- 317,323 ----
  	  // Set the heartbeat boolean to 1 and get back its value
  	  // prior to this update.
  	  CORBA::Boolean heartbeat = 1;
! 	  if (Strand::Sync::WrTestLock(s,heartbeat))
  	    {
  	      if (heartbeat) 
  		{
***************
*** 326,332 ****
  		  }
  		  s->shutdown();
  		}
- 	      Strand::Sync::WrUnlock(s);
  	    }
  	}
      }
--- 331,336 ----
diff -cr omniORB_2.7.1/src/lib/omniORB2/orbcore/strand.cc omniORB_2.7.1_patch/src/lib/omniORB2/orbcore/strand.cc
*** omniORB_2.7.1/src/lib/omniORB2/orbcore/strand.cc	Thu Mar 11 16:31:31 1999
--- omniORB_2.7.1_patch/src/lib/omniORB2/orbcore/strand.cc	Tue Jul 27 10:24:15 1999
***************
*** 29,34 ****
--- 29,38 ----
  
  /*
    $Log: strand.cc,v $
+   Revision 1.9  1999/05/26 11:54:13  sll
+   Replaced WrTimedLock with WrTestLock.
+   New implementation of Strand_iterator.
+ 
    Revision 1.8  1999/03/11 16:25:56  djr
    Updated copyright notice
  
***************
*** 143,148 ****
--- 147,164 ----
    return idle;
  }
  
+ CORBA::Boolean
+ Strand::is_unused(CORBA::Boolean held_rope_mutex)
+ {
+   CORBA::Boolean unused;
+   if (!held_rope_mutex)
+     pd_rope->pd_lock.lock();
+   unused = ((pd_head == 0)? 1: 0);
+   if (!held_rope_mutex)
+     pd_rope->pd_lock.unlock();
+   return unused;
+ }
+ 
  Strand::
  Sync::Sync(Strand *s,CORBA::Boolean rdLock,CORBA::Boolean wrLock) 
  {
***************
*** 316,362 ****
  
  }
  
- CORBA::Boolean
- Strand::
- Sync::WrTimedLock(Strand* s,
- 		  CORBA::Boolean& heartbeat,
- 		  unsigned long secs,
- 		  unsigned long nanosecs)
- {
-   CORBA::Boolean notimeout;
-   // Note: the caller must have got the mutex s->pd_rope->pd_lock 
- 
-   s->incrRefCount(1);
- 
-   while (s->pd_wr_nwaiting < 0) {
-     s->pd_wr_nwaiting--;
-     notimeout = s->pd_wrcond.timedwait(secs,nanosecs);
-     if (s->pd_wr_nwaiting >= 0)
-       s->pd_wr_nwaiting--;
-     else
-       s->pd_wr_nwaiting++;
-     if (!notimeout && s->pd_wr_nwaiting < 0) {
-       // give up;
- 
-       s->decrRefCount(1);
- 
-       return 0;
-     }
-   }
-   s->pd_wr_nwaiting = -s->pd_wr_nwaiting - 1;
-   
-   CORBA::Boolean hb = s->pd_heartbeat;
-   s->pd_heartbeat = heartbeat;
-   heartbeat = hb;
-   return 1;
- 
-   // XXX The caller must release the write lock using 
-   //     Strand::Sync::WrUnlock(Strand* s) *ONLY*!!!
-   //     Otherwise the strand reference count would be messed up.
-   //     Fix me in future.
- }
- 
- 
  void
  Strand::
  Sync::WrUnlock(CORBA::Boolean held_rope_mutex)
--- 332,337 ----
***************
*** 375,404 ****
    return;
  }
  
! void
  Strand::
! Sync::WrUnlock(Strand* s)
  {
!   // XXX The caller must have acquired the write lock previously
!   //     with Strand::Sync::WrTimedLock. Otherwise, the strand
!   //     reference count would be messed up.
!   //     Fix me in future.
! 
!   // Note: the caller must have got the mutex s->pd_rope->pd_lock
!   assert(s->pd_wr_nwaiting < 0);
!   s->pd_wr_nwaiting = -s->pd_wr_nwaiting - 1;
!   if (s->pd_wr_nwaiting > 0)
!     s->pd_wrcond.signal();
! 
!   s->decrRefCount(1);
!   // The strand may be idle and is dying, it should be deleted but
!   // it is just to dangerous to do it here. Let the strand iterator
!   // clean it up later.
  
!   return;
  }
  
- 
  void
  Strand::
  Sync::setStrandIsDying()
--- 350,369 ----
    return;
  }
  
! 
! CORBA::Boolean
  Strand::
! Sync::WrTestLock(Strand* s,CORBA::Boolean& heartbeat)
  {
!   if (s->pd_wr_nwaiting < 0)
!     return 0;
  
!   CORBA::Boolean hb = s->pd_heartbeat;
!   s->pd_heartbeat = heartbeat;
!   heartbeat = hb;
!   return 1;
  }
  
  void
  Strand::
  Sync::setStrandIsDying()
***************
*** 464,475 ****
    while ((p = next())) {
      p->shutdown();
      omni_thread::yield();
-     if (p->is_idle(1)) {
-       if (p->pd_heapAllocated)
- 	delete p;
-       else
- 	p->~Strand();
-     }
    }
    return;
  }
--- 429,434 ----
***************
*** 484,490 ****
      while ((p = next())) {
        if (!p->_strandIsDying()) {
  	n++;
! 	if (p->is_idle(1)) {
  	  secondHand = 1;
  	  break;
  	}
--- 443,449 ----
      while ((p = next())) {
        if (!p->_strandIsDying()) {
  	n++;
! 	if (p->is_unused(1)) {
  	  secondHand = 1;
  	  break;
  	}
***************
*** 571,585 ****
  {
    if (!held_rope_mutex)
      ((Rope *)r)->pd_lock.lock();
-   pd_s = r->pd_head;
    pd_rope = r;
    pd_leave_mutex = held_rope_mutex;
    return;
  }
  
  
  Strand_iterator::~Strand_iterator()
  {
    if (!pd_leave_mutex)
      ((Rope *)pd_rope)->pd_lock.unlock();
    return;
--- 530,549 ----
  {
    if (!held_rope_mutex)
      ((Rope *)r)->pd_lock.lock();
    pd_rope = r;
    pd_leave_mutex = held_rope_mutex;
+   pd_initialised = 0;
+   pd_s = 0;
    return;
  }
  
  
  Strand_iterator::~Strand_iterator()
  {
+   if (pd_s) {
+     pd_s->decrRefCount(1);
+     pd_s = 0;                // Be paranoid
+   }
    if (!pd_leave_mutex)
      ((Rope *)pd_rope)->pd_lock.unlock();
    return;
***************
*** 588,611 ****
  Strand *
  Strand_iterator::operator() ()
  {
!   Strand *p;
!   while (1) {
!     p = pd_s;
!     if (p) {
!       pd_s = pd_s->pd_next;
!       if (p->is_idle(1) && p->_strandIsDying()) {
! 	if (p->pd_heapAllocated)
! 	  delete p;
! 	else
! 	  p->~Strand();
! 	continue;
!       }
!     }
!     break;
    }
!   return p;
  }
- 
  
  Rope_iterator::Rope_iterator(const Anchor *a)
  {
--- 552,578 ----
  Strand *
  Strand_iterator::operator() ()
  {
!   if (pd_s) {
!     pd_s->decrRefCount(1);
!     pd_s = pd_s->pd_next;
!   }
!   else if (!pd_initialised) {
!     pd_s = pd_rope->pd_head;
!     pd_initialised = 1;
!   }
!   while (pd_s && pd_s->is_idle(1) && pd_s->_strandIsDying()) {
!     Strand *p=pd_s;
!     pd_s = pd_s->pd_next;
!     if (p->pd_heapAllocated)
!       delete p;
!     else
!       p->~Strand();
    }
!   if (pd_s) {
!     pd_s->incrRefCount(1);
!   }
!   return pd_s;
  }
  
  Rope_iterator::Rope_iterator(const Anchor *a)
  {

--=-=-=





-- 
Sai-Lai Lo                                   S.Lo@uk.research.att.com
AT&T Laboratories Cambridge           WWW:   http://www.uk.research.att.com 
24a Trumpington Street                Tel:   +44 1223 343000
Cambridge CB2 1QA                     Fax:   +44 1223 313542
ENGLAND


--=-=-=--