[omniORB] Canceling a blocking function

Han Kiliccote kiliccote@cmu.edu
Wed, 5 Apr 2000 11:15:54 -0400


> -----Original Message-----
> From: Tres Seaver [mailto:tseaver@palladion.com]
> Sent: Monday, April 03, 2000 4:16 PM
> To: Han Kiliccote
> Cc: omniorb-list@uk.research.att.com
> Subject: Re: [omniORB] Canceling a blocking function
>
>
>
>
> Han Kiliccote wrote:
> >
> > Thanks for your suggestion, however see below.
> >
> > -----Original Message-----
> > From: Tres Seaver <tseaver@palladion.com>
> > To: Han Kiliccote <kiliccote@cmu.edu>
> > Cc: omniorb-list@uk.research.att.com <omniorb-list@uk.research.att.com>
> > Date: Sunday, April 02, 2000 3:16 PM
> > Subject: Re: [omniORB] Canceling a blocking function
> >
> > >Han Kiliccote wrote:
> > >>
> > >> At Carnegie Mellon University, we are developing a prototype for a
> > >> distributed system that contains a very large number of
> servers (e.g.,
> > >> 100000servers). In this prototype, we need to send a request
> to a large
> > >> subset of these servers (e.g., 100).
> > >>
> > >> Currently we have a loop that uses a thread pool to attach a
> thread to a
> > >> server and each thread calls a function in a different server.
> > >>
> > >> When a percentage (e.g., 50) of these functions return, we
> would like to
> > >> cancel the operation in the remaining threads which are
> blocking either
> > >> because the servers are down/faulty or just about to
> complete but not yet
> > >> completed.
> > >>
> > >> Currently we don't know how to do this. In each remaining
> thread, there
> > is a
> > >> call
> > >>
> > >> server[i]-> do_function(argument) // blocked (no reply yet)
> > >>
> > >> How can we unblock this? We don't want to wait  more than
> 10sec for these
> > >> functions to timeout because since the overall function is deemed
> > completed,
> > >> there will be another request soon and this would cause a very large
> > number
> > >> threads to exist in the system at any given point. We dont
> want to lower
> > the
> > >> timeout to anything to less than 10sec because this would
> cause an early
> > >> abort in some cases.
> > >>
> > >> Your advice and help are greatly appreciated.
> > >>
> > >> P.S. Shall we switch to one-way functions?
> > >
> > >Consider very carefully using something like CosEvents/CosNotifications
> > >to manage the NxM communications you need here.  One way to handle your
> > >scenario:
> > >
> > > 1.  Create a notification channel within your "master" server (or in
> > >     a separate server, perhaps for scalability).
> > >
> >
> > I should have been clearer about the research. Our goal is to remove any
> > central server from the system (or replace each server with a randomly
> > chosen a large set of servers). Each client acts like a
> mini-server. Master
> > or monolithic servers are something we are trying to replace.
> The goal of
> > the research is to show that we can create a large system and
> still not have
> > any single point of failure dues to centralized servers.
>
> CosNotification and CosEvents servers can be federated, to
> provide for redunancy
> (and other benefits), while still insulating clients and servers
> (consumers and
> suppliers) from each other.  Your requirement essentially
> mandates that _each_
> client have knowledge of the entire system, with all the
> attendant complexity
> that involves.  If there is any possibility of heterogeneity
> among the clients
> (different hardware/OS/GUI/anything), you now have to port that
> complexity to
> each combination.  I don't buy this as increasing scalability in
> the least.
>
> As Dijkstra (I think) noted, you can't remove essential complexity from a
> system, but only that which is accidental;  the essential stuff
> just gets moved
> around.
>
> >
> > > 2.  Create another channel on which to broadcast requests from the
> > >     requests from master to slave servers.
> > >
> > > 2.  In each slave server, subscribe a pull consumer to the "request"
> > >     channel.  One thread loops as follows:
> > >
> > >       - pull new request from the requst channel & enqueue them
> > >
> > >       - pull cancellations from the request channel and mark their
> > >         requests.
> > >
> > >     Another thread pulls requests from the queue, processing each one
> > >     while checking at intervals to see if it has been cancelled.  On
> > >     completion, the processing thread pushes the result to
> the "result"
> > >     channel.
> > >
> > >     This server could perhaps be single-threaded, since you have to
> > >     break the "work" up into segmentes to allow checking for
> > >     cancellation.
> > >
> > > 2.  From the master server, BEFORE broadcasting your requests,
> > >     register a pull consumer on the "results" channel, using a filter
> > >     for the request ID you  are about to broadcast.
> > >
> > > 3.  On the master server, push the request onto the "request" channel.
> > >     Repeatedly pull results from the channel until reaching your
> > >     desired threshhold.  Unsubscribe from the channel (results not
> > >     yet received will go into the bit bucket).  Broadcast a cancel
> > >     on the current request.
> > >
> > >One-ways won't help a whole lot here, unless the
> request-processing time
> > >is very small.  The new asynchronous message invocation (AMI)
> spec might
> > >help, but I imagine that you are truly CPU bound here (else why 10E5
> > >servers), so the network latency is likely not a big problem.  The
> > >"scatter-gather" solution I proposed has the advantage of
> decoupling the
> > >master and the slaves, which becomes especially critical for issues
> > >involving large numbers of peers (yours is the largest number
> I have ever
> > >seen seriously proposed!)
> > >
> >
> > Actually (at our current implementation) we are network bound.
> Each request
> > takes 0.1ms to complete and 4ms is vasted on the network.
>
> I guess I'm dense, but I can't see the point of distributing such
> a request
> across 10E5 peers -- each client will spend vastly more time
> negotiating with
> its peers than it would spend processing the whole request internally.  In
> particular, what is the point of cancelling requests?  The
> chances of being able
> to signal a cancellation to the peer before the peer completes
> the request are
> vanishingly small.
>
> Looking back at your original question, I guess you may not need to signal
> cancellation to peers -- you seem merely to want to free up client-side
> resources associated with the request.  You also seem to be
> talking only to 10E2
> of the possible 10E5 at a time:  How do you manage this without
> some form of
> centralization?  Random-selection from a known universe of object
> references
> equires either a centralized server which performs the selection, or one
> which    serves up the entire list of OR's to clients (ICK!)).
> Do you intend to
> "compute" the OR's, somehow?
>

Excellent points. This is why what we are doing is called a research
project. Our solution involves generating a virtual interconnection network
(e.g., a hypercube) on top of the existing network. This way every client
(or server, there is no difference) knows/maintains connectivity with only a
small percentage (e.g., 20-1000 out of 10E5) of other clients but their
combined effort maintains the network reliably.

The location of the data/services is selected deterministically so that when
the users need to read the data or request a service, the users can find
which 100 out of 10E5 stores it (or serves the service) and uses the
interconnection network to eliminate dependencies on individuals.

I can send a copy of the paper we are submitting to ieee srds if you are
interested.

> >
> > I though omniorb does not support AMI. Am I wrong?
>
> I don't know -- TAO has increasingly good support for it;
> comp.soft-sys.ace is
> where I have seen the benefits of AMI discussed.

We already invested a lot of time on omniorb (and we really like it). I'm
looking for a solution that uses omniorb. For example can we send a dummy
message to the port that waits for the responses to cancel these calls? Will
this work? Of course how do we find the port number of that the thread that
waits?

Or can I kill the thread and except not to leave the omniorb in an unstable
state?

Advance thanks for your help and best wishes



>
> --
> =========================================================
> Tres Seaver  tseaver@digicool.com   tseaver@palladion.com