Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.AddRebusService, bus stops processing messages #93

Open
Meyce opened this issue Aug 13, 2024 · 4 comments
Open

.AddRebusService, bus stops processing messages #93

Meyce opened this issue Aug 13, 2024 · 4 comments
Assignees

Comments

@Meyce
Copy link

Meyce commented Aug 13, 2024

I've been encountering an issue where when using the "modular monolith", the individual Rebus services stop processing messages. The app currently uses the following

  1. .Net 8 console app
  2. Calling .AddRebusService 5 times from the Host Builder, to process 5 different queues. All the queues are related to the same domain.
  3. Using K8s/Workload Identities for credentials to Azure Service Bus, backing database used for saga storage
  4. Using Open Telemetry for logging
  5. Runs in Kubernetes

The problem I've run into is after the service is up and running successfully for a period of time, some/all queues stop getting processed and messages sit in ASB until the service is restarted. I don't know that this necessarily a Rebus problem, but I thought I would post here to see if anyone has any ideas. Currently it is hard for me to say how long it is before this occurs. This isn't a high volume service and I only notice it after it stops processing messages, which means it could enter a state where it isn't processing, but I don't know about it until a new message arrives. This could potentially be hours since it doesn't do much/any processing outside of regular business hours.

I'm using multiple calls to other .AddRebusService in other apps, but those apps are on .Net 6, don't use workload identities, and I don't believe they use OTEL either. I have the logging turned up to debug, and nothing useful is showing in the logs. Nothing that indicates when the services stop processing messages. There's no indication that anything is wrong until messages start backing up in the queues.

When it enters this state, there are differences in the logging when the service shuts down. I've noticed the following logging when the service starts and is shutdown successfully. The logging starts the buses in the order they are registered to the HostBuilder and shuts them down in the reverse order. So if the buses were named A,B,C,D,E. When the console app starts you see them started in that order, and when it shuts down they are shutdown in the opposite order (E,D,C,B,A). That seems the be consistent via my observations.

When I've noticed the problem occurs and I shutdown the service, only the last two buses registered via .AddRebusService log that they've shutdown. Nothing is logged for the other 3. This the only thing I've been able to find from the logs thus far.

Please let me know if you have any ideas. I'm still trying to gather info, but with the logs not supplying anything useful, I'm guessing at what the problem is. My next steps are the following:

  1. Reorder the .AddRebusService calls. I don't expect this to do anything but I want to see if the logging output changes when the service stops processing messages.
  2. Put in a Hosted Service that will send a periodic message to each of the queues, with an appropriate handler for each bus to handle the message. This will let me know when the problem occurs more quickly and which of the 5 buses is no longer processing messages.
  3. Switch from Workload Identities back to connection strings with id/password, to see if there is any change in behavior.

Thanks

@mookid8000 mookid8000 self-assigned this Aug 14, 2024
@mookid8000
Copy link
Member

Hi @Meyce , thanks for reporting your findings!

I've just gone over the code in Rebus.ServiceProvider responsible for setting up "the modular monolith" type of independent Rebus instance, and I could not find any obvious ways that e.g. an intermittent error would be able to crash that particular instance or cause it to stop processing messages in any other way.

If you find any indications of what could be wrong during your investigation, please let me know. I'll go and do a couple of clean-up cycles on the code of Rebus.ServiceProvider to see if I come across anything while cleaning/refactoring.

@Meyce
Copy link
Author

Meyce commented Aug 16, 2024

@mookid8000,

Thanks for checking. The problem came up in production again last night. One of the queues stopped being processed.

I think I'm am starting to see a pattern emerge between the queue that stops getting processed, and some weirdness in the logging during shutdown. It seems that for whichever queue/bus has an issue, during shutdown I start to see some messages related to the shutdown and disposal of that bus, however only a few of the expected messages are written to the log, and then no other logs show related to the bus or the service as a whole. So I'll see the following messages at the tail of the log

  1. Stopping token signaled - disposing bus instance RebusBus Routing
  2. Worker "Routing worker 1" stopped
  3. Waiting for message handler continuations to finish...

Once that last message is written, no other logs are written related to the bus or the service as a whole. There should be more logs just related to the shutdown of the bus, let alone the shutdown of the service.

I'm wonder if that last message is an indication of something Async that is blocking, and is perhaps the reason the bus stopped receiving/processing messages in the first place. I can say that it is not always the same bus that causes the issue. I'm going to look back through the the handlers registered on the bus, to make sure I don't have something that could be blocking. I think it is probably unlikely that it a poorly written handler as the service works fine for a period of time, and it seems like the error would occur in a specific handler consistently which doesn't seem to be the case.

I'll report back after a deeper look into the handlers.

@mookid8000
Copy link
Member

Hi @Meyce , have you looked further into this? Did it happen to you again?

@Meyce
Copy link
Author

Meyce commented Sep 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants