-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition causing acking to fail from AutoRecoveringConnection after a recovery #47
Comments
I will review this, but it might be time to start the connection pool design over based off of current functionality being implemented. |
The one thing I do really dislike is that Pivotal implemented the consumer tag recovery event on the connection, not the channel. Makes no sense to me as it should be the channel or you could argue even the consumer itself. I didn't handle the event in ConnectionHost for this reason (would need a concurrent dictionary and then ChannelHost to call some method on ConnectionHost to check for a recovered tag, which ConnectionHost would then need to remove). Ugly and an anti pattern. |
I will start working on a reliable test - all my chaos engineering I used for its original design has fallen to the dust or I no longer work at those companies. I will need to start over, configure a RabbitMQ server, start closing connections, restarting servers... I unfortunately will not have multiple nodes to retest on. Once I see the behaviors at play, I can start planning a new design around that. |
If it helps I can try to get a test up which is similar to the one I used to find the issue. It uses EasyNetQ package but I haven't yet updated that to the latest 2.0.0 (has lots of breaking changes over 1.x.x). |
Just to let you know I have begun work on this. I've made a Dockerfile and docker-compose to build and run the tests, plus a separate compose for RabbitMQ which exposes 5672 and 15672. This is then spun up on a network and has a Makefile for ease of use (both local and CI/CD). It's still a bit of a work in progress but I managed to make a good chunk of the "skipped" Facts no longer so (there's a check in RabbitFixture on connectivity to the host on the 5672 port; it if fails the tests just return). |
Great minds think of like, I was thinking of just fully integrating this type of setup for the tests projects. |
Sorry for the delay with this @houseofcat . It's now in #48 - minimal changes made to actual code (only ones necessary to override/access particular methods/properties to prove that using the built-in recovery fixes the issues). You can run it locally with The tests are now passing on your automated checks, albeit the new ones will be skipped there of course as there's no rabbit connection. |
Now the 6.5.0 client is out there may be a tidier way by hooking into rabbitmq/rabbitmq-dotnet-client#1304 - will check it out later (have a plan anyway). Edit: Yep, the new event made things quite a lot neater and fewer changes overall to base classes; I also made a new library RabbitMQ.Recoverable for everything (decided to move away from the default/current classes having Arguments with an Id they didn't use etc; seemed wrong). |
With the latest RabbitMQ.Client 6.5.0, if we always call (around try/catch) Edit: Actually, they sometimes ack the messages now, I guess that is the race condition though as more often than not they don't. |
Hmmm. So I added an extra bit to the tests whereby it publishes and consumes an initial message (just as a sanity check), that of course works for everything; then it recovers (closes and waits for them to reconnect) the connections, pauses processing, publishes half the prefetch and recovers them again (once that half are unacknowledged). It then publishes the remainder (plus an extra 10), once the unacknowledged reaches prefetch it resumes processing and waits for all messages to be processed. Now, this works fine (all passes) for the new recoverable channel host/pool in the tests; but without them the connections are never reconnected the second time, which I didn't expect. Because of this it doesn't even get as far as the failure to acknowledge the messages (because there's not even a connection to publish to, let alone consume from). Any ideas why that might be happening? I'm perplexed.
as opposed to
|
I believe this is coincidentally solved by #50 |
This was a nasty one to track down and (I think) got introduced by Pivotal from RabbitMQ.Client 6.2.4 on.
Looking at
TryPerformAutomaticRecovery
andCreateModel
inAutorecoveringConnection
(snipped for relevant parts) there's a big issue around locking on_models
in particular, which can lead to race conditions (especially withTopologyRecoveryEnabled
with this taking several CPU cycles):When
RecoverModelsAndItsConsumers
is hit, this callsAutomaticallyRecover
inAutorecoveringModel
which sets the offset delivery tags viaInheritOffsetFrom
- and if this happens afterCreateModel
is called to create a new channel/model (viaMakeChannelAsync
inChannelHost
) then the delivery tags for theReceivedData
channels written to the underlyingChannel
are not adjusted and thus never get acked:There is another separate issue around
TopologyRecoveryEnabled
in that it's used as the same flag toRecoverConsumers
(?!) which then recreates the oldEventingBasicConsumer
s, with their events recreated as well. These recreated consumers then pick up the messages instead of the newly-created ones fromChannelHost
and so theConsumer
classes never receive the events and thus nothing ends up in theChannel
s.The best way around this I could find was to implement a
RecoverChannelAsync
via aRecoveryAwareChannelHost
which will attempt to use what the library gives us back (hooking into the relevant events, as long as the IConnection and IModel are IAutorecoveringConnection and IRecoverable respectively) andRecord
ing (orDeleteRecord
ing whenbasic.cancel
is handled) so that it can hook into theConsumerTagChangeAfterRecovery
event from the connection. It's not perfect as theRecoveryAwareChannelHost
has to hook into events from theIAutorecoveringConnection
but it's all added/removed and events protected by locks (via protected functions in the baseChannelHost
class which also now has protectedChannel
andConnHost
- only settable by the base class, however).I also noticed a small oversight in
ChannelHost
withClose
checking!Closed || !Channel.IsOpen
which feels like it should be!Closed && Channel.IsOpen
, so I changed that, and added aTransientChannelPool
as a base class forChannelPool
when you really just want transient channels - for e.g., a separate channel pool for consumers (which has no need for creating ackable and non-ackable collections of channels). It helped me track a lot of this stuff down as there was considerable noise from the publisher channels when using a shared channel pool.Anyway there is now a PR in which I have battle-tested locally and it works with both
TopologyEnabled
and not. I would appreciate your feedback on it @houseofcat as this is a blocker to release this to production (thankfully, this issue was spotted in lower environments before it really had an impact there!)The text was updated successfully, but these errors were encountered: