[omniORB] Infinite polling loop with HUP socket causing 100% CPU usage

Erik Cumps erik.cumps at esaturnus.com
Mon May 4 09:53:55 BST 2015


<PING>

Ideas anyone?

Thx,
Erik Cumps

-----Original Message-----
From: Erik Cumps <erik.cumps at esaturnus.com>
Reply-to: erik.cumps at esaturnus.com
To: omniorb-list at omniorb-support.com
Cc: erik.cumps at esaturnus.com
Subject: Infinite polling loop with HUP socket causing 100% CPU usage
Date: Tue, 21 Apr 2015 11:27:11 +0200


Hi,

we just noticed a strange phenomenon with omniORB and we think this may be caused by a bug
in the socket handling implementation of omniORB.

Context: services implemented in python using omniORB and the omniORB python bindings, using
the following debian packages:

	ii  libomniorb4-1   4.1.6-2  i386  omniORB core libraries
	ii  python-omniorb  3.6-1    i386  Python bindings for omniORB

The service has a TCP socket open and when the remote client disconnects or when the connection
is broken the service has a thread that ends up in an infinite polling loop, using 100% CPU.

We have two cases, one service which has a thread repeatedly calling poll() and accept() on the
socket and the other service has a thread which is repeatedly calling poll() on the socket.

With strace the first case looks like this:

        ...
        poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 1) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
        accept(5, 0, NULL)                      = -1 EINVAL (Invalid argument)
        gettimeofday({1429606184, 378281}, NULL) = 0
        poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 0) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
        accept(5, 0, NULL)                      = -1 EINVAL (Invalid argument)
        gettimeofday({1429606184, 378671}, NULL) = 0
        poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 0) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
        accept(5, 0, NULL)                      = -1 EINVAL (Invalid argument)
        gettimeofday({1429606184, 379108}, NULL) = 0
        gettimeofday({1429606184, 379209}, NULL) = 0
        poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 50) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
        accept(5, 0, NULL)                      = -1 EINVAL (Invalid argument)
        gettimeofday({1429606184, 379729}, NULL) = 0
        poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 49) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
        accept(5, 0, NULL)                      = -1 EINVAL (Invalid argument)
        gettimeofday({1429606184, 380145}, NULL) = 0
        ...
        
As you can see each poll() returns with the POLLERR and POLLHUP events for the socket, and each
accept() returns with the EINVAL errno.

Also notice how the timeout for each poll() starts at 50 and becomes smaller until it reaches 0,
after which it starts again at 50.

The second case looks like this:

        ...
        poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 50) = 1 ([{fd=4, revents=POLLHUP}])
        gettimeofday({1429607075, 414623}, NULL) = 0
        poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 50) = 1 ([{fd=4, revents=POLLHUP}])
        gettimeofday({1429607075, 414870}, NULL) = 0
        ...
        
Again, notice how each poll() returns with the POLLHUP event for the socket.

So it looks like the code is not noticing the POLLHUP event and is also not
handling the accept() EINVAL error. This is presumably leading to the caller
continuously retrying and hence the polling loop.

We looked a bit at the code and we noticed that here is no POLLHUP handling.
In fact, grepping the code for POLLHUP returns no results at all.

And with regards to the EINVAL result of the accept() calls, we believe that
this error is reported (if tracing is on) but otherwise not acted upon. From
the CORBA::Boolean tcpEndpoint::notifyReadable(SocketHolder* sh) method in
src/lib/omniORB/orbcore/tcp/tcpEndpoint.cc:

          if (sh == (SocketHolder*)this) {
            // New connection
            SocketHandle_t sock;
        again:
            sock = ::accept(pd_socket,0,0);

===> when the accept() returns, sock == RC_SOCKET_ERROR and ERRNO = EINVAL

            if (sock == RC_SOCKET_ERROR) {
              if (ERRNO == RC_EBADF) {
        
===> the EINVAL case does not end up here

                omniORB::logs(20, "accept() returned EBADF, unable to continue");
                return 0;
              }
              else if (ERRNO == RC_EINTR) {
        
===> the EINVAL case does not end up here either

                omniORB::logs(20, "accept() returned EINTR, trying again");
                goto again;
              }
        #ifdef UnixArchitecture
              else if (ERRNO == RC_EAGAIN) {
        
===> the EINVAL case does not end up here either

                omniORB::logs(20, "accept() returned EAGAIN, will try later");
              }
        #endif
              if (omniORB::trace(20)) {
        
===> the EINVAL case ends up here only if omniORB::trace(20) returns true
===> even so it doesn't do anything useful with it

                omniORB::logger log;
                log << "accept() failed with unknown error " << ERRNO << "\n";
              }
        
===> the if block is exited here

            }
            else {
        #if defined(__vxWorks__)
              // vxWorks "forgets" socket options
              static const int valtrue = 1;
              if(setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
                            (char*)&valtrue, sizeof(valtrue)) == ERROR) {
                return 0;
              }
        #endif
              // On some platforms, the new socket inherits the non-blocking
              // setting from the listening socket, so we set it blocking here
              // just to be sure.
              SocketSetblocking(sock);
        
              pd_new_conn_socket = sock;
            }
        
===> finally the EINVAL case ends up here:
===> the socket holder is indicated to be selectable
===> and the functione returns true.

            setSelectable(1,0,1);
            return 1;
          }


Kind regards,
Erik Cumps

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.omniorb-support.com/pipermail/omniorb-list/attachments/20150504/768658c3/attachment.html>


More information about the omniORB-list mailing list