[omniORB] Canceling a blocking function

Tres Seaver tseaver@palladion.com
Mon, 03 Apr 2000 15:16:09 -0500


Han Kiliccote wrote:
> 
> Thanks for your suggestion, however see below.
> 
> -----Original Message-----
> From: Tres Seaver <tseaver@palladion.com>
> To: Han Kiliccote <kiliccote@cmu.edu>
> Cc: omniorb-list@uk.research.att.com <omniorb-list@uk.research.att.com>
> Date: Sunday, April 02, 2000 3:16 PM
> Subject: Re: [omniORB] Canceling a blocking function
> 
> >Han Kiliccote wrote:
> >>
> >> At Carnegie Mellon University, we are developing a prototype for a
> >> distributed system that contains a very large number of servers (e.g.,
> >> 100000servers). In this prototype, we need to send a request to a large
> >> subset of these servers (e.g., 100).
> >>
> >> Currently we have a loop that uses a thread pool to attach a thread to a
> >> server and each thread calls a function in a different server.
> >>
> >> When a percentage (e.g., 50) of these functions return, we would like to
> >> cancel the operation in the remaining threads which are blocking either
> >> because the servers are down/faulty or just about to complete but not yet
> >> completed.
> >>
> >> Currently we don't know how to do this. In each remaining thread, there
> is a
> >> call
> >>
> >> server[i]-> do_function(argument) // blocked (no reply yet)
> >>
> >> How can we unblock this? We don't want to wait  more than 10sec for these
> >> functions to timeout because since the overall function is deemed
> completed,
> >> there will be another request soon and this would cause a very large
> number
> >> threads to exist in the system at any given point. We dont want to lower
> the
> >> timeout to anything to less than 10sec because this would cause an early
> >> abort in some cases.
> >>
> >> Your advice and help are greatly appreciated.
> >>
> >> P.S. Shall we switch to one-way functions?
> >
> >Consider very carefully using something like CosEvents/CosNotifications
> >to manage the NxM communications you need here.  One way to handle your
> >scenario:
> >
> > 1.  Create a notification channel within your "master" server (or in
> >     a separate server, perhaps for scalability).
> >
> 
> I should have been clearer about the research. Our goal is to remove any
> central server from the system (or replace each server with a randomly
> chosen a large set of servers). Each client acts like a mini-server. Master
> or monolithic servers are something we are trying to replace. The goal of
> the research is to show that we can create a large system and still not have
> any single point of failure dues to centralized servers.

CosNotification and CosEvents servers can be federated, to provide for redunancy
(and other benefits), while still insulating clients and servers (consumers and
suppliers) from each other.  Your requirement essentially mandates that _each_
client have knowledge of the entire system, with all the attendant complexity
that involves.  If there is any possibility of heterogeneity among the clients
(different hardware/OS/GUI/anything), you now have to port that complexity to
each combination.  I don't buy this as increasing scalability in the least.

As Dijkstra (I think) noted, you can't remove essential complexity from a
system, but only that which is accidental;  the essential stuff just gets moved
around.

> 
> > 2.  Create another channel on which to broadcast requests from the
> >     requests from master to slave servers.
> >
> > 2.  In each slave server, subscribe a pull consumer to the "request"
> >     channel.  One thread loops as follows:
> >
> >       - pull new request from the requst channel & enqueue them
> >
> >       - pull cancellations from the request channel and mark their
> >         requests.
> >
> >     Another thread pulls requests from the queue, processing each one
> >     while checking at intervals to see if it has been cancelled.  On
> >     completion, the processing thread pushes the result to the "result"
> >     channel.
> >
> >     This server could perhaps be single-threaded, since you have to
> >     break the "work" up into segmentes to allow checking for
> >     cancellation.
> >
> > 2.  From the master server, BEFORE broadcasting your requests,
> >     register a pull consumer on the "results" channel, using a filter
> >     for the request ID you  are about to broadcast.
> >
> > 3.  On the master server, push the request onto the "request" channel.
> >     Repeatedly pull results from the channel until reaching your
> >     desired threshhold.  Unsubscribe from the channel (results not
> >     yet received will go into the bit bucket).  Broadcast a cancel
> >     on the current request.
> >
> >One-ways won't help a whole lot here, unless the request-processing time
> >is very small.  The new asynchronous message invocation (AMI) spec might
> >help, but I imagine that you are truly CPU bound here (else why 10E5
> >servers), so the network latency is likely not a big problem.  The
> >"scatter-gather" solution I proposed has the advantage of decoupling the
> >master and the slaves, which becomes especially critical for issues
> >involving large numbers of peers (yours is the largest number I have ever
> >seen seriously proposed!)
> >
> 
> Actually (at our current implementation) we are network bound. Each request
> takes 0.1ms to complete and 4ms is vasted on the network.

I guess I'm dense, but I can't see the point of distributing such a request
across 10E5 peers -- each client will spend vastly more time negotiating with
its peers than it would spend processing the whole request internally.  In
particular, what is the point of cancelling requests?  The chances of being able
to signal a cancellation to the peer before the peer completes the request are
vanishingly small.

Looking back at your original question, I guess you may not need to signal
cancellation to peers -- you seem merely to want to free up client-side
resources associated with the request.  You also seem to be talking only to 10E2
of the possible 10E5 at a time:  How do you manage this without some form of
centralization?  Random-selection from a known universe of object references
equires either a centralized server which performs the selection, or one
which    serves up the entire list of OR's to clients (ICK!)).  Do you intend to
"compute" the OR's, somehow?

> 
> I though omniorb does not support AMI. Am I wrong?

I don't know -- TAO has increasingly good support for it;  comp.soft-sys.ace is
where I have seen the benefits of AMI discussed.

-- 
=========================================================
Tres Seaver  tseaver@digicool.com   tseaver@palladion.com