[omniORB] Urgent: omniORB::fatalException in omni2.6.1

Randy Shoup rshoup@tumbleweed.com
Fri, 06 Aug 1999 12:20:01 -0700


Sai-Lai Lo wrote:
> 
> Randy,
> 
> It would be helpful if I can have a copy of the omniORB source you are
> using. Please upload it into the directory
>  ftp://ftp.uk.research.att.com/pub/incoming/omniorb

OK.  Done.

> 
> It is also helpful to understand what interactions the client and server
> are engaging in.

We have two processes involved here, one a "gateway" and the other an
"extension".  The extension registers itself with the gateway, and
indicates through its interface that it wants to override or augment
certain transactions of the gateway.  As the gateway is processing
(HTTP) transactions, it delegates some of the processing to the
extension as appropriate.  For this call, the gateway is the client of
the extension.  However, the extension also calls back to the gateway
during its processing of the transaction, so the extension is also a
client of the gateway.  In addition, during the registration phase, the
extension is also a client of the gateway.

I should mention that the extension sometimes unregisters and
reregisters with the gateway.  This is intended to be because the
gateway has gone down and come up, but because of the inconsistencies we
have experienced with _non_existent(), sometimes the extension thinks
that the gateway has gone down and come up when it in fact never went
down at all.  This reregistration behavior is triggered by another
"watchdog" process, so it is effectively asynchronous with the rest of
the processing.

The problem does not seem to occur with any particular transaction --
that is, it does not appear to be related to any particular transaction
that the gateway or the extension is handling.  This lack of pattern
made us suspect the scavenger.

> 
>  1. Is the crash occurs in the server?

For a variety of reasons, we have not been able to step through the code
to see the problem actually occurring.  We receive a fatalException
rather high up in our code.

The code where we receive the exception is on the gateway side, so it is
likely to be on the "client" side of the interaction.  However, we don't
seem to be actually making a call to the extension at the point where we
notice the exception, either.  It is as if something has happened in the
background which caused one of the rope/strand pairs to become
inconsistent.

However, the crash does appear to happen while or just after the
extension is unregistering and reregistering itself with the gateway. 
This is definitely suspicious.

> 
>  2. Does the server do callback to the clients?
>       The location of the fatal exception suggests that this is
>       an outgoing rope.

Yes.  See above.  There are a lot of calls which go in both directions.

> 
>  3. Is GIOP LOCATION_FORWARD used? If both sides are omniORB and you do not
>     use the dynamic object loading hook, then LOCATION_FORWARD is not
>     generated.

We are not using LOCATION_FORWARD.

> 
> The bugs related to race conditions with the scavengers have been fixed but
> I want to be sure your source has a consistent set of fixes.

It is certainly possible that we made a mistake in applying these fixes.

> 
> > I have several questions:
> 
> > (1) We turn the scavengers off after the ORB object is created, but
> > before the BOA object is created.  Is this sufficient to make sure that
> > the scavengers stop running?  (From examining the omniORB code, I think
> > the answer is yes, but now I am not 100% sure)
> 
> YES.

Hmmm.  This appears to take the scavenger out of the list of potential
problems, then.

> 
> > (2) What else could cause this fatalException?  It seems to occur
> > because of a mismatch in the "idle" states between the Rope and the
> > Strand -- the Rope is idle, but the Strand is not.  Is there any other
> > way that a Rope could be set to idle, and the Strand not be set to idle,
> > other than by the action of the scavenger?  Idleness appears to be
> > related to the reference counts on these objects, so perhaps there is a
> > problem there?
> 
> The reference count on a Rope equals the no. of proxy objects created in
> the address space that use the Rope. A remote address space maps to a Rope.
> 
> One possible cause of the problem, although I think it is unlikely, is that
> a thread has called release on an object reference while another thread is
> using that object reference to do a remote invocation. The release causes
> the ref count on the rope to goes to 0 but a strand within the rope is
> still active.
> 
> Another scenerio is that a thread is using an object reference which has
> been released. The memory has actually been returned to the heap but have
> not been modified yet. Again the rope ref count goes to 0 but a strand is
> still active.

We will look into this.

> 
> > (3) Could we fix the mismatch of "idle" states in another way -- i.e.,
> > could we perhaps un-idle the Rope if we discover one of the Rope's
> > Strands is not idle?  I am wondering here if we could avoid throwing
> > this exception altogether by cleaning up the inconsistency
> > automatically.
> 
> Yes, it is safe to avoid throwing the exception. The safe thing to do is to
> leave the rope alone. Or you can "un-idle" the Rope by calling incrRefCount
> on the rope.
> 
> However, I think this is just a symptom of something else is wrong. So
> "un-idle" the Rope might just shift the crash to somewhere else.

That is what I was afraid of.  We may try it if we get desperate.

If we incremented the Rope ref count, would it ever get decremented?

> 
> I suggest you double check you code to make sure that it is not doing the 2
> things I suggested above. The suspect is how you manage the callback object
> reference.

Thanks for the tip and all the help.

Thanks,
-- Randy
_________________________________________________________________  
Randy Shoup                                     (650)216-2038  
Software Architect                              rshoup@tumbleweed.com  
Tumbleweed Communications Corporation