[omniORB] Canceling a blocking function

Sun, 02 Apr 2000 14:12:57 -0500

Han Kiliccote wrote:
> 
> At Carnegie Mellon University, we are developing a prototype for a
> distributed system that contains a very large number of servers (e.g.,
> 100000servers). In this prototype, we need to send a request to a large
> subset of these servers (e.g., 100).
> 
> Currently we have a loop that uses a thread pool to attach a thread to a
> server and each thread calls a function in a different server.
> 
> When a percentage (e.g., 50) of these functions return, we would like to
> cancel the operation in the remaining threads which are blocking either
> because the servers are down/faulty or just about to complete but not yet
> completed.
> 
> Currently we don't know how to do this. In each remaining thread, there is a
> call
> 
> server[i]-> do_function(argument) // blocked (no reply yet)
> 
> How can we unblock this? We don't want to wait  more than 10sec for these
> functions to timeout because since the overall function is deemed completed,
> there will be another request soon and this would cause a very large number
> threads to exist in the system at any given point. We dont want to lower the
> timeout to anything to less than 10sec because this would cause an early
> abort in some cases.
> 
> Your advice and help are greatly appreciated.
> 
> P.S. Shall we switch to one-way functions?

Consider very carefully using something like CosEvents/CosNotifications
to manage the NxM communications you need here.  One way to handle your
scenario:

 1.  Create a notification channel within your "master" server (or in
     a separate server, perhaps for scalability).

 2.  Create another channel on which to broadcast requests from the
     requests from master to slave servers.

 2.  In each slave server, subscribe a pull consumer to the "request"
     channel.  One thread loops as follows:

       - pull new request from the requst channel & enqueue them

       - pull cancellations from the request channel and mark their
         requests.

     Another thread pulls requests from the queue, processing each one
     while checking at intervals to see if it has been cancelled.  On
     completion, the processing thread pushes the result to the "result"
     channel.

     This server could perhaps be single-threaded, since you have to
     break the "work" up into segmentes to allow checking for
     cancellation.

 2.  From the master server, BEFORE broadcasting your requests,
     register a pull consumer on the "results" channel, using a filter
     for the request ID you  are about to broadcast.

 3.  On the master server, push the request onto the "request" channel.
     Repeatedly pull results from the channel until reaching your
     desired threshhold.  Unsubscribe from the channel (results not
     yet received will go into the bit bucket).  Broadcast a cancel
     on the current request.

One-ways won't help a whole lot here, unless the request-processing time
is very small.  The new asynchronous message invocation (AMI) spec might
help, but I imagine that you are truly CPU bound here (else why 10E5
servers), so the network latency is likely not a big problem.  The
"scatter-gather" solution I proposed has the advantage of decoupling the
master and the slaves, which becomes especially critical for issues
involving large numbers of peers (yours is the largest number I have ever
seen seriously proposed!)

Best,

Tres.
-- 
=========================================================
Tres Seaver  tseaver@palladion.com