[omniORB] Infinite polling loop with HUP socket causing 100% CPU usage

Erik Cumps erik.cumps at esaturnus.com
Fri May 8 15:43:28 BST 2015


Hi Duncan,

thanks for looking into this.

On wo, 2015-05-06 at 15:27 +0100, Duncan Grisby wrote: 
> On Tue, 2015-04-21 at 11:27 +0200, Erik Cumps wrote:
> 
> [...]
> > Context: services implemented in python using omniORB and the omniORB python bindings, using
> > the following debian packages:
> > 
> > 	ii  libomniorb4-1   4.1.6-2  i386  omniORB core libraries
> > 	ii  python-omniorb  3.6-1    i386  Python bindings for omniORB
> 
> Can you test with omniORB 4.2.0, or the current svn snapshot?  It would
> be easier to look into this if you were using a modern version of
> omniORB.

We are using an older version of omniORB because we are tracking debian
wheezy, we have no immediate plans to change this.

Also I've noticed that 4.1.6 is still your latest stable release and
that there wasn't much activity in omniORB/src/lib/omniORB/orbcore/tcp/
so I'm hoping we can track this down in 4.1.6?

> [...]
> > With strace the first case looks like this:
> >         ...
> >         poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 1) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
> >         accept(5, 0, NULL)                      = -1 EINVAL (Invalid argument)
> 
> The fact that it's calling accept() shows that the socket in question is
> the listening socket used to accept new incoming connections. I don't
> see how that can possibly receive a "hang-up" condition. There is
> nothing to hang up. The EINVAL return from accept() shows that it thinks
> the socket is invalid, but I don't know how that could be. Has the
> network interface been disabled or something?
> 
> If the socket omniORB is using to listen for new connections has become
> invalid, I'm not sure what it should do. It can't usefully carry on.

I agree. But it should do something different than performing an
infinite loop because it doesn't see the POLLERR on the fd or the
accept() call failing with EINVAL.

Would it make sense to handle the EINVAL as an EBADF here?

> [...]
> > The second case looks like this:
> >         ...
> >         poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 50) = 1 ([{fd=4, revents=POLLHUP}])
> >         gettimeofday({1429607075, 414623}, NULL) = 0  
> [...]
> > Again, notice how each poll() returns with the POLLHUP event for the socket.
> 
> I assume that this socket is one associated with a connection. In this
> case, omniORB doesn't think anything of interest has occurred on the
> socket, because it's only looking for POLLIN. Perhaps it should look for
> POLLIN | POLLHUP, then it would try to read from the socket and
> (presumably) notice an error condition. Does the attached patch help?
> The patch is against omniORB 4.2.x, but you should be able to apply
> something similar to your 4-year-old 4.1.6 if you need to.

The patch applies alright and I believe it will break the infinite loop,
but I don't think it is sufficient in and of itself. As you say,
treating the POLLHUP as a POLLIN event would lead subsequent reads to
read 0 bytes (an EOF is not an error condition). If that happens, the
caller of those read()s should detect the EOF condition and handle it
appropriately.

When a Recv() is called for a tcp, unix or SSL socket it will return -1
in case of EOF, so callers will notice this and act appropriately.

However, in the else block of the if() statement modified by the patch,
we find:

                 else if (pd_pollfds[index].fd == pd_pipe_read) {
        #ifdef UnixArchitecture
                    char data;
                    read(pd_pipe_read, &data, 1);
                    pd_pipe_full = 0;
        #endif
                  }
        
So if the POLLHUP happens for the pd_pipe_read file descriptor, the code
will attempt to read a single byte and although it will actually read
zero bytes it will not detect this, so it cannot handle it. The
SocketCollection::Select() call will return 1, and it will keep on doing
this each time it is called again later.

Are there any other places in the code where an EOF condition on a pipe
or socket would not be detected because the caller does not check the
return value of recv() or read()?

Kind regards,
Erik Cumps




More information about the omniORB-list mailing list