[omniORB] Unrecoverable error for this endpoint - EBADF

Norrie Quinn norrie.quinn@tumbleweed.com
Wed Sep 25 22:59:01 2002


Hi,

Was this race condition ever fixed for omniORB4?

We are seeing the same behaviour on SMP Windows 2000 machines under heavy
load, and the patch below (or similar) does not seem to have been applied to
the cvs source.

Regards
Norrie

> -----Original Message-----
> From: Bastiaan Bakker [mailto:Bastiaan.Bakker@lifeline.nl]
> Sent: Tuesday, February 05, 2002 1:43 AM
> To: Duncan Grisby
> Cc: omniorb-list@uk.research.att.com
> Subject: RE: [omniORB] RE: serious stability problems with omniORB4
> snapshots on Solaris 8: bug located!
> 
> 
> Hi,
> 
> I've created a small patch to work around the EBADF problem. 
> As I suggested yesterday, it simply retries the fd_set 
> creation and select() in case of EBADF. In a couple of quick 
> tests, using 20 concurrent eg2_clts it retries once every 
> 1000 to 4000 SocketCollection::Select() calls. Of course on 
> very busy systems this figure may become impractically worse.
> 
> Please let me know what you think.
> 
> Cheers,
> 
> Bastiaan Bakker
> LifeLine Networks bv
> 
> 
> -----Original Message-----
> From: Bastiaan Bakker [mailto:Bastiaan.Bakker@lifeline.nl]
> Sent: Monday, February 04, 2002 7:05 AM
> To: Duncan Grisby
> Cc: omniorb-list@uk.research.att.com
> Subject: RE: [omniORB] RE: serious stability problems with omniORB4
> snapshots on Solaris 8: bug located!
> 
> 
> Hi all,
> 
> I've located a race condition in SocketCollection::Select, 
> which causes at least one of my problems:
> 
> the 'Unrecoverable error for this endpoint: 
> giop:unix:/tmp/echo.bb, it will no longer be serviced.' is 
> caused by a race condition in SocketCollection::Select. This 
> method first creates a file descriptor set and then performs 
> a select on it. However, between the fd_set creation and the 
> select call another thread may have closed() a connection 
> file descriptor in this set. This causes select() to return 
> EBADF ('invalid file descriptor'). Way up in the call chain 
> this is translated to an 'unrecoverable error', with known results....
> 
> I guess the easiest solution to this problem is to check for 
> EBADF and retry the fd_set creation and select() in that case. 
> 
> Any suggestions?
> 
> Cheers,
> 
> Bastiaan Bakker
> LifeLine Networks bv