Possible recovery strategies to handle transient COMM_FAILURE (was Re: problem with dying server)

Thu, 10 Jul 1997 21:44:48 +0100

Sorry for being a bit long-winded. I'm describing below a possible enhancement
to omniORB. It may be of interest to you. Comments are welcomed.

>>>>> Matthew Newhook writes:

>   I have a server that always uses the same object key, and port (ie. it's
>   a persistent server) - server A.  I have another persistent server B
>   that uses server A.

>   If server A dies, and is restarted it seems that the Strand in server B
>   still points to the old connection that was established with server A.

>   Of course, server B attempts to invoke methods on A which fail 'cause
>   the socket is bad (COMM_FAILURE).  The next attempt re-establishes
>   the connection (because the strand is missing).  It seems to me that
>   a COMM_FAILURE should result in *one* reconnection attempt before
>   resulting in a total COMM_FAILURE.

Your analysis is correct, the current implementation does throw a
COMM_FAILURE under the condition you described. This issue was also raised
by Hans Huebner earlier on this mailing list.

While the CORBA spec. does not mandate any error handling strategy, I agree
that the ORB should not just give up when a cached connection is broken.

------------------------------------------------

There are a number of possible improvements:

1. When a cached connection is broken, throw a TRANSIENT exception.
   Reconnection will be attempted when the application repeats the call.
   If an attempt to connect fails or the call is the first one to use a
   connection that breaks, then a COMM_FAILURE exception is raised.

2. When a cached connection is broken, retry the call *once* and throw
   a COMM_FAILURE if either the reconnection attempt fails or the
   connection is again broken during the call.

3. Same as 2 but retry a few more times/indefinitely instead of just
   *once*.

4. A variation of 1 that I will explain below.

------------------------------------------------

There is a related issue with *location forwarded objects*, i.e. objects
that have been redirected by IIOP LOCATION_FORWARD messages.

If a call to a location forwarded object fails because the connection to
it is broken, instead of COMM_FAILURE, the current implementation throws a
TRANSIENT exception.

A COMM_FAILURE exception is not appropriate in this case because the
original location, given in the object's IOR, should be taken as the true
home of the object. The forwarded location is only a hint. If it
fails, it should be dropped and the original re-used.

Then why doesn't the ORB just retry transparently using the original,
instead of throwing a TRANSIENT?

Well, if the ORB retries, there is a possibility that it will receive the
same LOCATION_FORWARD message from the home location of the object and
redirects the call to the same forwarded location, and fails again and
repeats the cycle again and again...

This is the reason why the current implementation throws a TRANSIENT
exception and the application can retry if it wants to. If it does so, the
object's home location will be contacted.

-----------------------------------------------------

I think in most applications, it is too troublesome to write code to catch
the TRANSIENT exception and to relaunch the call again. Typically, one would
retry up to a certain number of times, possibly with increasing delay
between each call, and if in the end none of the attempts succeed, a
COMM_FAILURE exception is thrown. May be there are also situations in which one
does not want to retry or one wants to retry forever.

Therefore, I'm thinking of enhancing the implementation to allow the
application to install a handler for TRANSIENT exception. The ORB let the 
handler decides when to perform a retry. The handler's
signature looks like this:

typedef void (*transient_handler_t)(CORBA::ULong n_retries, 
   	                            const CORBA::TRANSIENT& ex) 
                                         throw (CORBA::SystemException);

The new scheme works as follows:

The stub code catches the TRANSIENT exception and calls the
transient_handler with the exception value and the number of retries
attempted as arguments. The handler can return normally, in which case the
stub code will retry the call again. Alternatively, the handler can throw a
System Exception and this will be propagated out of the stub code into the
application code.

The handler is installed by either of these functions:

void omniORB::install_transient_handler(transient_handler_t fn);
void omniORB::install_transient_handler(CORBA::Object_ptr obj,
                                        transient_handler_t fn);

The first one installs a handler for all object references. The second
applies to individual object reference.

---------------------------------------------------------------

Coming back to the 4th alternative to handle broken cache connection, I
suggest treating this as a TRANSIENT exception and based on the
policy determined by the transient_handler, retry the call transparently.

Sai-Lai Lo

-- 
E-mail:         S.Lo@orl.co.uk          |       Olivetti & Oracle Research Lab
                                        |       24a Trumpington Street
Tel:            +44 223 343000          |       Cambridge CB2 1QA
Fax:            +44 223 313542          |       ENGLAND