[omniORB] Another attempt at "Reference Counting"?

Duncan Grisby dgrisby@uk.research.att.com
Fri, 22 Mar 2002 11:04:40 +0000


On Friday 22 March, bjorn rohde jensen wrote:

>  I think, you need something a little more complicated than adding
> a ref and unref method to the idl. You will need to address the
> problem of fault tolerance.

Distributed garbage collection is HARD. There is no single solution
that is appropriate for all applications.

As Bjorn says, the difficult bit about reference counting is fault
tolerance. It is trivial to do the counting bit by providing ref and
unref operations, and calling deactivate_object() when the count goes
to zero. (As an aside, I strongly recommend against using the POA
servant reference count as the count of remote clients -- the counts
are counting different things, and it's confusing and dangerous to
combine the two. There is no guarantee about how many references a POA
might choose to hold to a servant, for example.)

So, fault tolerance is the hard bit. You need to be able to properly
clean up if a client unexpectedly dies, or maliciously goes away
without cleaning up. There are basically three ways you could do this:

  1. Make no attempt to clean up automatically. A human administrator
     periodically looks through all the live objects and manually
     removes the ones that are no longer needed. This sounds stupid,
     but if you don't have many objects, and failures are rare, it's
     far simpler than a technical solution. Obviously to do this, you
     need to add an administration API. Essentially the same effect
     can be achieved by periodically killing and restarting the
     server.

  2. Have clients periodically "ping" the server to let it know they
     are still alive. Deactivate the object if no pings have been
     received recently.

  3. Have the server periodically "ping" the clients to find out if
     they are still alive. Decrement the reference count if a client
     does not respond to its ping.

Notice how despite the fact that we are calling this "reference
counting", both options 2 and 3 (especially 3) require that we know
_who_ our clients are, not just how many of them there are. This means
that we need to be very careful about the mechanism by which one
client passes a reference to another (or perhaps disallow it):

Imagine client A holds a reference to object O. A wants to give client
B a reference to it too. A can't just increment O's reference count
then pass the object reference to B -- something somewhere along the
line needs to tell O that it is B that is going to hold the new
reference.  One solution is for A to pass the reference to B, and for
B to contact O to register its presence. It is vital that A continues
to hold its reference for as long as it takes B to register, otherwise
the object might have been deactivated before B did its registration.

Now, should clients ping the server or vice-versa?  If clients ping
the server, we have to invent our own means to identify clients; if
the server pings, clients' identities are implicit in the ping target.
If the server pings, it is susceptible to clients that take a long
time to respond to a ping, since it probably blocks waiting for the
client. On the other hand, if an object has 1000 clients, perhaps the
server will only choose to ping one of them, and choose a different
one if that one dies; if clients are doing the pinging, the server
will be dealing with 1000 times more pings.

If we choose to have clients do the pinging, it's possible to treat
normal operation invocations as pings -- the client is told that it
must contact the object at least as often as some interval. If it has
a reason to contact the server within that time anyway, there's no
need for it to do a separate ping. That saves a bit of network use,
but makes everything more complicated for both client and server.

Whatever way we choose to do it, the big danger with a pinging scheme
is that we might use a significant proportion of the network and
processing capacity just for the pings. There are various ways to
limit this, such as coalescing pings on a per-process or per-machine
basis: if a client process holds references to ten objects all in the
same server process, it only really needs to send a single ping to
keep all ten references alive. Doing that adds more management
overhead, of course. Choosing the ping interval is an important
decision too.

Then there's the question of what form the pings themselves take. The
obvious thing is to add operations to the IDL interfaces, and do the
pings as CORBA calls. That's easy, but a full CORBA call is a pretty
heavy-weight thing just for a ping. Perhaps it's better to use UDP
packets or something like that.

Finally, what should be done about transient network problems?  If the
network between client and server goes down, whatever pinging scheme
is in use will decide that the client has died. When the network comes
back up, the client might not have noticed that there was anything
wrong, and so expect the object to still be there.

As I said, hard.

Cheers,

Duncan.

-- 
 -- Duncan Grisby  \  Research Engineer  --
  -- AT&T Laboratories Cambridge          --
   -- http://www.uk.research.att.com/~dpg1 --