[omniORB] RE: serious stability problems with omniORB4 snapshots on Solaris 8

Bastiaan Bakker Bastiaan.Bakker@lifeline.nl
Fri, 1 Feb 2002 13:45:14 +0100



> -----Original Message-----
> From: Duncan Grisby [mailto:dgrisby@uk.research.att.com]
> Sent: Thursday, January 31, 2002 5:43 PM
> To: Bastiaan Bakker
> Cc: omniorb-list@uk.research.att.com
> Subject: Re: [omniORB] RE: serious stability problems with omniORB4
> snapshots on Solaris 8=20
>=20
>=20
> On Thursday 31 January, "Bastiaan Bakker" wrote:
>=20
> > I *do* know where the 'pure virtual method call' comes from: in
> > notifyWkDone, Peek() is called on a deleted connection.=20
> What I do not
> > know is where the connection is deleted (actually=20
> decRefCount'ed) while
> > it's still in use.
>=20
> Something's clearly broken. I won't have a chance to look at this
> until the week after next, since I'm busy this week, and next week I'm
> at the International Python Conference.
>=20
> Some thoughts if you want to track this down yourself: I assume the
> problem doesn't occur under Linux?  Does the Solaris machine have more

On Linux snapshot 20011013 appears stable. With 20020103 and 20020130 I =
get deadlocks after a while, but no crashes like on Solaris. So I'm not =
sure whether it's the same problem.
=20
> than one processor?  If so (and the Linux machine doesn't) it might be
> a race condition thing. Are you able to try Sun's compiler rather than
> gcc?  It might be a gcc problem.
>=20

Actually the Linux machine has two processors and the Solaris machine =
only one....
We don't have a Sun compiler available, so we can't try that.
=20
> Most likely, though, it's a bug in the new omniORB transport code. Try
> running with a high trace level to see if that prints anything
> interesting before the crash (if it's a race condition, the tracing
> might prevent it, of course...). To get a handle on when the

At level 20 I didn't see anything interesting, at level 40 I couldn't =
get it to crash.

> connection is deleted, try adding tracing to the destructor. If you
> print out the this pointer, that'll help you find the relevant
> destruction when the crash happens. Code something like:
>=20
>   {
>     omniORB::logger l;
>     l << "connection deleted: " << (void*)this << "\n";
>   }
>=20
> Hope that helps. If you haven't tracked it down by the time I'm back,
> I'll look into it properly then.
>=20

logging the deletion itself doesn't show who's responsible for it. I'm =
going to add logging to all places where incrRefCount and decrRefCount =
are called. Hopefully that will give some clues.

Cheers,

Bastiaan

> Cheers,
>=20
> Duncan.
>=20
> --=20
>  -- Duncan Grisby  \  Research Engineer  --
>   -- AT&T Laboratories Cambridge          --
>    -- http://www.uk.research.att.com/~dpg1 --
>=20