.AddRebusService, bus stops processing messages #93

Meyce · 2024-08-13T16:02:51Z

I've been encountering an issue where when using the "modular monolith", the individual Rebus services stop processing messages. The app currently uses the following

.Net 8 console app
Calling .AddRebusService 5 times from the Host Builder, to process 5 different queues. All the queues are related to the same domain.
Using K8s/Workload Identities for credentials to Azure Service Bus, backing database used for saga storage
Using Open Telemetry for logging
Runs in Kubernetes

The problem I've run into is after the service is up and running successfully for a period of time, some/all queues stop getting processed and messages sit in ASB until the service is restarted. I don't know that this necessarily a Rebus problem, but I thought I would post here to see if anyone has any ideas. Currently it is hard for me to say how long it is before this occurs. This isn't a high volume service and I only notice it after it stops processing messages, which means it could enter a state where it isn't processing, but I don't know about it until a new message arrives. This could potentially be hours since it doesn't do much/any processing outside of regular business hours.

I'm using multiple calls to other .AddRebusService in other apps, but those apps are on .Net 6, don't use workload identities, and I don't believe they use OTEL either. I have the logging turned up to debug, and nothing useful is showing in the logs. Nothing that indicates when the services stop processing messages. There's no indication that anything is wrong until messages start backing up in the queues.

When it enters this state, there are differences in the logging when the service shuts down. I've noticed the following logging when the service starts and is shutdown successfully. The logging starts the buses in the order they are registered to the HostBuilder and shuts them down in the reverse order. So if the buses were named A,B,C,D,E. When the console app starts you see them started in that order, and when it shuts down they are shutdown in the opposite order (E,D,C,B,A). That seems the be consistent via my observations.

When I've noticed the problem occurs and I shutdown the service, only the last two buses registered via .AddRebusService log that they've shutdown. Nothing is logged for the other 3. This the only thing I've been able to find from the logs thus far.

Please let me know if you have any ideas. I'm still trying to gather info, but with the logs not supplying anything useful, I'm guessing at what the problem is. My next steps are the following:

Reorder the .AddRebusService calls. I don't expect this to do anything but I want to see if the logging output changes when the service stops processing messages.
Put in a Hosted Service that will send a periodic message to each of the queues, with an appropriate handler for each bus to handle the message. This will let me know when the problem occurs more quickly and which of the 5 buses is no longer processing messages.
Switch from Workload Identities back to connection strings with id/password, to see if there is any change in behavior.

Thanks

The text was updated successfully, but these errors were encountered:

mookid8000 · 2024-08-14T12:23:45Z

Hi @Meyce , thanks for reporting your findings!

I've just gone over the code in Rebus.ServiceProvider responsible for setting up "the modular monolith" type of independent Rebus instance, and I could not find any obvious ways that e.g. an intermittent error would be able to crash that particular instance or cause it to stop processing messages in any other way.

If you find any indications of what could be wrong during your investigation, please let me know. I'll go and do a couple of clean-up cycles on the code of Rebus.ServiceProvider to see if I come across anything while cleaning/refactoring.

Meyce · 2024-08-16T15:15:36Z

@mookid8000,

Thanks for checking. The problem came up in production again last night. One of the queues stopped being processed.

I think I'm am starting to see a pattern emerge between the queue that stops getting processed, and some weirdness in the logging during shutdown. It seems that for whichever queue/bus has an issue, during shutdown I start to see some messages related to the shutdown and disposal of that bus, however only a few of the expected messages are written to the log, and then no other logs show related to the bus or the service as a whole. So I'll see the following messages at the tail of the log

Stopping token signaled - disposing bus instance RebusBus Routing
Worker "Routing worker 1" stopped
Waiting for message handler continuations to finish...

Once that last message is written, no other logs are written related to the bus or the service as a whole. There should be more logs just related to the shutdown of the bus, let alone the shutdown of the service.

I'm wonder if that last message is an indication of something Async that is blocking, and is perhaps the reason the bus stopped receiving/processing messages in the first place. I can say that it is not always the same bus that causes the issue. I'm going to look back through the the handlers registered on the bus, to make sure I don't have something that could be blocking. I think it is probably unlikely that it a poorly written handler as the service works fine for a period of time, and it seems like the error would occur in a specific handler consistently which doesn't seem to be the case.

I'll report back after a deeper look into the handlers.

mookid8000 · 2024-09-02T08:02:38Z

Hi @Meyce , have you looked further into this? Did it happen to you again?

Meyce · 2024-09-02T19:29:38Z

Yes. It’s been occurring about once a week. I set up a hosted service in the test env to send a message to each queue every 15 minutes. Each queue handles the “heartbeat” message and writes to the logs. Set up this way the service has ran without issue for over two weeks. I’ve got logging turned up to Debug but nothing useful shows up. Sent from Gmail Mobile

…

On Mon, Sep 2, 2024 at 4:03 AM Mogens Heller Grabe ***@***.***> wrote: Hi @Meyce <https://github.com/Meyce> , have you looked further into this? Did it happen to you again? — Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACILQEVNXSHFVRYHKB3OFFTZUQLTJAVCNFSM6AAAAABMOSWSCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUGA3TKMJRGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mookid8000 self-assigned this Aug 14, 2024

mookid8000 added the investigation label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.AddRebusService, bus stops processing messages #93

.AddRebusService, bus stops processing messages #93

Meyce commented Aug 13, 2024

mookid8000 commented Aug 14, 2024

Meyce commented Aug 16, 2024

mookid8000 commented Sep 2, 2024

Meyce commented Sep 2, 2024 via email

.AddRebusService, bus stops processing messages #93

.AddRebusService, bus stops processing messages #93

Comments

Meyce commented Aug 13, 2024

mookid8000 commented Aug 14, 2024

Meyce commented Aug 16, 2024

mookid8000 commented Sep 2, 2024

Meyce commented Sep 2, 2024 via email