[omniORB] Hang up in ThreadPool mode

Thu May 19 10:43:30 BST 2005

Hello

For a very large configurations (a couple of hundreds clients for a 
single server)
I have noticed that my server, which is using evidently the ThreadPool 
model, hangs
very soon. The problem is very well reproducible. I have traced the problem
down and found the complicated scenario, which results in this hang up.
Can somebody comment please on this behavior, because it is very 
important for our
system. I'm using omniORB 4.0.5 with SLC3 Linux, kernel 2.4.21-27, G++ 
3.2.3.

NB: We have never noticed this problem until we run our server on the 
machine, which
has 4 Pentium IV processors. May be the problem shows up because of the 
real high
parallelism between different omniORB threads.

HANG UP SCENARIO:

1. Assume that there is a connection C0, which has a corresponding file 
descriptor FD0,
    which is processed by the select function in the SocketCollection class.
2. select notice that there is some data on FD0 and calls notifyRzReadable
    function, which in turn creates a new task to process this data.
3. Sometime the giopServer::notifyWkDone is called at that moment, i.e.
    after creation of the task but before the data are read out from the 
FD0.
    The giopServer::notifyWkDone function sometimes (don't know what the 
conditions
    are) goes into the following code (lines 1009-1024 of the 
giopServer.cc), which
    makes the FD0 socket again selectable by adding to the 
SocketCollection object.

      if (conn->pd_n_workers > 1 ||
          pd_n_temporary_workers > orbParameters::maxServerThreadPoolSize) {

        w->remove();
        delete w;
        conn->pd_n_workers--;
        pd_n_temporary_workers--;

        select_and_return = 1;
      }
    }
    if (select_and_return) {
      // Connection is selectable now
      conn->setSelectable(1);
      return 0;
    } 

4. The situation now is the following:
       - there is a dedicated task, which should process the input on 
the FD0,
                    but which HAS NOT DONE IT YET
       - FD0 is selectable and will be used by the following invocation 
of the select function

5. select function is called and mark FD0 as having some input data 
(this is the same data as
    for the previous select invocation). The notifyRzReadable is called 
again, which creates one
    more task for processing this input (basically steps 2 and 3 are 
repeated here).

6. The first task, created at the step 2 reads the data from FD0 and 
process the request
    successfully.

7. The second task hangs trying to read from the FD0 until another 
request arrives over
    the C0 connection.

The problem is that for a large configurations all the threads in the 
server's ThreadPool
went into this state very quickly and the server completely hangs.

SOLUTION (OR WARKAROUND):
It seems that I succeeded to solve the issue by commenting out the line 
1022 of the
giopServer.cc file ("conn->setSelectable(1);"), but I'm not sure that
this solution is correct and does not have any other undesirable effect 
for the omniORB.

Cheers,
Sergei