[omniORB] Urgent: omniORB::fatalException in omni2.6.1

Randy Shoup rshoup@tumbleweed.com
Fri, 06 Aug 1999 14:48:35 -0700


Randy Shoup wrote:
> 
> Sai-Lai Lo wrote:
> >

> We have two processes involved here, one a "gateway" and the other an
> "extension".  The extension registers itself with the gateway, and
> indicates through its interface that it wants to override or augment
> certain transactions of the gateway.  As the gateway is processing
> (HTTP) transactions, it delegates some of the processing to the
> extension as appropriate.  For this call, the gateway is the client of
> the extension.  However, the extension also calls back to the gateway
> during its processing of the transaction, so the extension is also a
> client of the gateway.  In addition, during the registration phase, the
> extension is also a client of the gateway.
> 
> I should mention that the extension sometimes unregisters and
> reregisters with the gateway.  This is intended to be because the
> gateway has gone down and come up, but because of the inconsistencies we
> have experienced with _non_existent(), sometimes the extension thinks
> that the gateway has gone down and come up when it in fact never went
> down at all.  This reregistration behavior is triggered by another
> "watchdog" process, so it is effectively asynchronous with the rest of
> the processing.
> 
> The problem does not seem to occur with any particular transaction --
> that is, it does not appear to be related to any particular transaction
> that the gateway or the extension is handling.  This lack of pattern
> made us suspect the scavenger.
> 

> > > (2) What else could cause this fatalException?  It seems to occur
> > > because of a mismatch in the "idle" states between the Rope and the
> > > Strand -- the Rope is idle, but the Strand is not.  Is there any other
> > > way that a Rope could be set to idle, and the Strand not be set to idle,
> > > other than by the action of the scavenger?  Idleness appears to be
> > > related to the reference counts on these objects, so perhaps there is a
> > > problem there?
> >
> > The reference count on a Rope equals the no. of proxy objects created in
> > the address space that use the Rope. A remote address space maps to a Rope.
> >
> > One possible cause of the problem, although I think it is unlikely, is that
> > a thread has called release on an object reference while another thread is
> > using that object reference to do a remote invocation. The release causes
> > the ref count on the rope to goes to 0 but a strand within the rope is
> > still active.
> >

This seems likely to have been it.

After your suggestion, we re-reviewed the code, looking for a place
where we were not properly duplicate'ing/release'ing a reference.  We
found one in the gateway code which uses the extension.  This code was
not duplicating the reference, so that if the extension unregistered
itself (thereby decrementing the reference count) during the time in
which we were invoking or preparing to invoke on the extension
reference, the ref count could go to zero, and cause the behavior you
describe.  Bottom line:  always duplicate when you are using a
reference! :-)

This seems extremely likely to have been the problem, but we would also
surely be interested in any other suggestions.  I'll update the list
when we are more sure.

Thanks,
-- Randy
_________________________________________________________________  
Randy Shoup                                     (650)216-2038  
Software Architect                              rshoup@tumbleweed.com  
Tumbleweed Communications Corporation