[omniNotify] omniNotify subscription bug

Thu Aug 7 10:58:06 BST 2003

Robert,

I have discovered quite a nasty bug that causes the Notification Service to
stop pushing events to consumers and stop accepting new consumers/suppliers.
The service still receives events from connected suppliers. This leads to
huge memory leaks as events back up in the notification service and all
communications in the system stop as no new consumers or suppliers can
connect. I will explain the scenario under which this occurs - please let me
know what other information you may require. 

Environment:
	- omniNotify latest CVS snapshot (4 aug 03), omniORB latest CVS
snapshot (4 aug 03).
	- omniNotify 2.0 alpha, omniORB 4.0.1
	- Solaris 2.8 (gcc 3.2.1) and Win2k (VS 6.0).

Another VERY interesting phenomenon is that this does not occur with the
Win32 build of the NS, only the Solaris build. Unfortunately, all our server
systems run on Solaris.

As I stated in my last email regarding performance I have started using a
single consumer connection (per channel) with a more complex filter.
Basically, our applications dynamically add and remove subscriptions on the
fly resulting in the need to update the constraints in place on the filter. 

My test rigs have picked up the problem mentioned above when subscriptions
are added to a filter while the notification service is under load. The test
rig uses the following format for each constraint. A filter is created on
the notification service and subscribers are added in a loop, from 1 to 512
using filter->add_constraints(). (NOTE: If this same test is done with 512
unique subscribers, each with their own filter, the problem does not appear.
It only occurs when constraints are added to an existing filter). Each
constraint expression is very simple and can be represented below:

{ [ context, 1_type ], "true" }
{ [ context, 2_type ], "true" }
	...
{ [ context, 512_type ], "true" }

Once all the constraints have been added the test app will then send X
number of events (say 10,000), as fast as it can, that will match only one
of the constraints ( say { [ context, 256_type ], "true" } ). With a single
test application the behaviour is as expected, the subscriber for [context,
256_type] receives the expected number of events.

Now - I will outline where the problem occurs.

1) Launch the test app mentioned above
2) Once messages start arriving launch a second copy of the test app with
the same number of subscribers (1 ... 512) but only sending 1000 events (the
number of events does not really matter)
3) Once a few subscriptions have been added to the new consumer filter, the
Notification Service will stop responding. The NS will still receive events
from suppliers already connected (ie, the first test app), but it will no
longer push any more events to consumers. 
4) The second test app will start throwing exceptions on the call to
filter->add_constraints( ), after the corba timeout period, as the NS has
stopped responding.
5) Once the first test app finishes sending 10,000 events it will attempt to
remove the constraints it has put in place, this will also fail on the call
to filter->modify_constraints( ) after the corba timeout period (currently
30 seconds)
6) Even after the test apps have been killed the NS will never recover from
this state.

As you can imagine, this is a bit of a showstopper problem. Our applications
perform operations similar to this quite often. The system usually stops
responding after about 20 mins.

Robert - please let me know if you need anymore information. It is not
simple for me to send you the code as it is heavily tailored to our own
applications and libraries. However, if you have problems replicating the
issue I should be able to modify one of the example applications to emulate
the above scenario.

Thanks for your help
Cameron Rochester