-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[azservicebus] Receiver indefinitely "stuck" after a long idle period #18517
Comments
Hi @tkent, thank you for filing this issue. I know it's frustrating to deal with a bug, so I appreciate you working with me on this. We have tests for these kinds of scenarios but clearly, since you're seeing a bug, I'm missing something. I'll see what I'm mising there. |
@richardpark-msft - Hey, I appreciate you looking into it. Frustrating, yes, but it would be much more frustrating if we didn't have a work around or we filed an issue that gets dismissed/ignored. Priority wise, since we have a work around it's not high on our list. That said, I'd imagine others don't want to have to go through the learning process on this one. |
…alls (#19506) I added in a simple idle timer in #19465, which would expire the link if our internal message receive went longer than 5 minutes. This extends that to track it across multiple consecutive calls as well, in case the user calls and cancels multiple times in a row, eating up 5 minutes of wall-clock time. This is actually pretty similar to the workaround applied by the customer here in #18517 but tries to take into account multiple calls and also recovers the link without exiting ReceiveMessages().
Hey @tkent , I added in a client-side idle timer that does something similar to what you outlined above. It recycles the link if nothing is received for 5 minutes, under the covers. It was released in azservicebus 1.1.2 Closing this now as we've formally implemented something similar to your workaround :). This should help combat a situation I've been worried about for a bit. If the server idles out our link or detaches it and we miss it then our link will still look alive during these quiet times, even if it's never going to work. We now close out the link and attempt to recreate it, which will force a reconciling between the service and client. |
Reopening as there's still work for this. |
Hello @tkent 🙂 Are you still using the same workaround? Have you perhaps tried the same solution but with a shorter timeout duration? If yes, have you noticed any differences? |
@rokf , are you seeing this issue as well? |
👋 @rokf - I am still using this work around. The workaround has been stable for the past several months, so we haven't touched it. I picked 10 minutes arbitrarily and we've not tried a lower time (though I'm sure that would work). However... @richardpark-msft it's important to note that just last Wednesday (01/18), we might have experienced the issue in our prod environment. We had a situation where we again we stopped receiving messages and had no errors emitted from the SDK. This time, unlike others, our Azure Service Bus We're adding more debugging to try to figure out what on earth happened, but that may have been this same problem showing up after 6+ months with the work around. The lack of visibility into the problem makes it very hard to tell. |
@tkent Thank you 👍 @richardpark-msft yes, it seems so. We've had similar problems in the past with the previous SDK (messages not being read). Things were looking good for a while with the new SDK (this one) since we've migrated. Last week something started happening (see graphs below). The version that we're using is:
Restarting indeed works - the messages get picked up immediately. We're planning to set up alerts on the production Service Bus namespaces to catch messages that are stuck and we'll try to put additional logs into the client for the time being. We're also looking into the possibility to introduce some kind of periodic queue connection health checks. |
@tkent are you perhaps recreating the receiver (https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus#Client.NewReceiverForQueue) and/or client (https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus#NewClient) before every https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus#Receiver.ReceiveMessages call? |
@rokf - We are not, we keep the same receiver for the lifetime of the application execution. It may be helpful to know that the service using this workaround doesn't processing many messages yet. In our highest volume environment, a given instance will receive less than 500 messages a day and our peak messages per second is like 2-3. We wouldn't have seen any intermitted issues that come up during even moderate load yet. |
We are also experiencing same problems since the 14.1.2023. This can be seen in the graph below which shows active connections. The spikes back up indicate that we have restarted containers. Must also mention that our traffic isn't really high and most of the times the receivers are idling. @richardpark-msft Do you perhaps have some more info on this? |
@alesbrelih, can you generate a log for your failures? I'm working on a few fixes - the first one that should come out in the next release will introduce a timeout when we close links, which can hang indefinitely in some situations. |
I've checked logs and here are the results. Stuck consumer:
If I compare logs of the broken consumer with a working one, I'm missing refreshing/negotiate claims [Auth] logs 🤔 The last time the Auth refresh happened it was 2 days ago, and since then there is no mention of it. Even though it says it will expire in 15 minutes. It is also accompanied with context deadline exceeded errors that are not present before this:
If you need more data or something specific please let me know. |
@richardpark-msft Hey Richard, we've noticed that a new version has been released - https://github.com/Azure/azure-sdk-for-go/tree/sdk/messaging/azservicebus/v1.2.0. Do you suggest that we switch to that version? |
I always do, but I'm a bit biased. :) I'm still working on adding some resiliency to see if we can help more with a part of this situation. From what I can tell we're getting into a situation where the link hasn't re-issued credits (which tell Service Bus to send us more messages). I have a fix in the works for that, but I'm still going over ramifications for it. The bug fix I added for this release was to make it so our internal Close() function would time out - prior to this it could actually hang for a long time and if it was cancelled it could leave things in an inconsistent state. This can affect things even if you aren't specifically calling Close() since we will close things internally during network recovery. So yes, definitely recommend. There are some additional logging messages now that show if the code path is being triggered and is remediating the issue (note, these are always subject to change in the future):
If your client was stuck or seems to disappear during the recovery process this could be the root cause. |
We are still having issues with this. Is this something that is still being worked on? |
There's been several fixes in this area, can you give me more details @alesbrelih? Specifically, there have also been some changes on the service-side, and we have been working on the go-amqp stack as well to improve reliability. I'd need internal logs from your situation as well. See here on how to enable them: |
@alesbrelih, sorry I missed that you've provided details above. I'm following up on a fix that was made on the service-side that matches what you were seeing, in that you have valid credits on the link but no messages are being returned. I'll post back here when I have details. |
That's a very useful insight. I think that explains what we saw a few times and means I can stop chasing it. |
Can this issue affect Senders as well? We have some code that opens a client, sends a message and then closes the client. This isn't performant and can take a couple of seconds before the message is sent to ASB. We'd like to reduce the time it takes to send the message using long living clients (senders), just as for the readers, and I was wondering if we can do more harm than good in the current situation? |
@rokf, the approach you have is basically eating the cost of starting up a TCP connection each time. You can improve this a few ways:
Now, in either case, it's possible for the connection to have to restart if you idle out, so there might still be some initialization cost based on network conditions, etc... But the tactics above are the easiest ways to avoid having to eat the startup cost on every send. |
Hi all, just want to update with some info. There have been a lot of fixes in the latest package release to improve reliability/resilience to azservicebus: azservicebus@v1.3.0. In this release, we incorporated the GA version of go-amqp (the underlying protocol stack for this package). A lot of the work that went into this was around deadlocks and race conditions within the library that could cause it to become unresponsive. In addition there have been service-side fixes related to messages not being delivered, despite being available. Every release brings improvements but this one hits at the core of the stack and should yield improvements in overall reliability, especially in some potential corner cases with connection/link recovery. I'd encourage all the people involved in this thread to upgrade. |
Thank you Richard, we'll update ASAP 👍 |
@richardpark-msft. I will roll out the new library version in our lower envs and let it run for a bit. However, I'm not going to roll it out to anywhere in our "live paths". While validation of this type of thing isn't nearly as useful without real traffic, the work around I have in place is stable and I don't want to introduce risk there. In the original description, I included some terraform and a description of how reproduce the issue. Has this library been tested against that use case or some comparable one? |
Appreciate all the testing you've done and your workaround is harmless. It should work just fine either way. It would be interesting, in production code, if you still see your workaround trigger - at that point we'd want to involve the service team to see if there's something interesting happening there, instead of just focusing on the client SDK.
We run three different kinds of tests: unit, live/integration and long-term. We've added a lot of tests in all three areas based on feedback from you (and others) to see if we could get to the bottom of it. We definitely found and fixed bugs, but I never replicated the ease of your scenario despite trying a lot of variations. However, I do have your scenario covered here: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/internal/stress/tests/infinite_send_and_receive.go. We generally run these tests for a week to give them space to fail and we have a few other tricks using chaos-mesh to try to induce failures earlier. (another one inspired by some bugs: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/internal/stress/tests/mostly_idle_receiver.go) |
Hi @tkent. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @tkent, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
Bug Report
After a long period of inactivity, a receiver will stop receiving new messages. A "long period" is sometime more than 8 hours and less than 13 days, but exactly how long is unknown.
(The problem was originally brought up in this comment)
This is very straightforward to demonstrate if you are willing to wait and setup a dedicated service bus instance. It occurs frequently with infrastructure used for QA, since that will often not receive any activity over weekends and holidays.
SDK Versions Used
I have seen this behavior across many versions of the
azure-sdk-for-go
, but the most recent test was conducted using these versions:About the most recent time this was reproduced
We most recently reproduced this by running a small golang app in an AKS cluster using a managed identity assigned by
aad-pod-identity
.In this test, we setup a dedicated azure service bus + managed identity (terraform below) and let the app run. After 13 days, we came back to it . No errors had been emitted, just the regular startup message for the app. I then entered a message into the bus. The receiver in the app did not pickup the message after waiting 30 minutes. We deleted the pod running the app and allowed it to be recreated by the deployment. The replacement pod immediately picked up the message.
Workaround
We can work around this issue by polling for messages using a 10 minute timeout and restarting in a loop. Our workaround looks like this and is known to work for weeks without an issue.
The terraform for the test bus
The terraform below was used to setup the test bus and assign the app identity access to it.
The text was updated successfully, but these errors were encountered: