-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Picker poisoning after calling Listener(...)
and PickAsync(...)
in parallell
#2407
Comments
@JamesNK Hi. Please look at this issue. |
Hi I took at the sample. When I ran the new test, it eventually failed but I didn't see it go into a state where the picker stops working forever. The picker switches between one that is successful to one that is failing a number of times until eventually there are 50 failing calls and then the test fails. What do I need to do to the test to make it have the problem? What does it look like when the error is reproduced? Log output:
|
I updated the test slightly to reset the error count when there is a successful pick, and to increase the number of allowed concurrent errors. I made these changes with the assumption that the problem shows up when the channel goes into an error state forever. After these changes the test passes for me. |
Hi! Call to After you changed the test you count 5 errors in line 638 and then start doing But if you change left interval border on line 648 ( |
Sorry, I still don't understand what the bug is your test is showing. While looking at this I noticed a bug about a pending connection attempt not being canceled when subchannel addresses were updated. While fixing that, I added some more logging. PR: #2410 I rebased your changes on top of that PR and increased time between resolver calls, and total number of calls and errors allowed. These are the logs: I don't see anything unexpected in the logs. If a resolver provides new results and the subchannel has a transient error when trying to connect, then the connection is retried with a backoff. Eventually the connection suceeds with a retry and the subchannel goes back to a ready state. For example, a subchannel that is erroring and then the backoff is complete, TryConnectAsync is called again, and this time it succeeds so the picker goes back to a ready state:
|
Hi! It seems that with your fix it works better and errors stop quickly after they started. Could you please tell when it goes to release? |
Reconnect and connection backoff is already there. It's been a feature for years. I fixed what I think might be an unrelated bug here - #2410 - when investigating this issue but that's it. You're welcome to get the latest package from the nightly feed and try again. |
Yes, I mean fix #2410 |
Should be fixed in 2.63.0 |
@JamesNK Hi!
We have a problem with using
PollingResolver
andSubchannelsLoadBalancer
. There is a case (occuring mainly under very intensive use) whenConnectionManager.PickAsync(...)
become poisoned and always returnsErrorPicker
result even if there are valid addresses.Below I tried to described when it could happen.
What happened:
ConnectionManager.PickAsync(...)
, it started to execute and it callsawait ConnectionManager.GetPickerAsync(...)
.GetPickerAsync(...)
finishes execution, returns_nextPickerTcs.Task.WaitAsync(...)
which then awaited insidePickAsync(...)
.PollingResolver.Listener(...)
with new addresses. If addresses are really new then new subchannels will be created and at least one of them can getTransientFailure
when trying to connect.Listener(...)
with new addresses called just afterGetPickerAsync(...)
returns but before returned task completed and one of sunchannels hasTransientFailure
state) thenPicker
become poisoned forever. It will always returnErrorPicker
andcurrentPicker.Pick(...).Type
will bePickResultType.Fail
.For this situation to happen there should be client side balancing enabled with custom resolver,
ConnectionManager.PickAsync(...)
calls should be often, connections should sometimes be broken. Even if it seems to be rare case nevertheless it happened often enough under high load and intensive use.Reproducing
I've made a test in my fork of this repository to reproduce this case: https://github.com/kolonist/grpc-dotnet/pull/1
Feel free to take and use that code as you wish.
About system:
We use
grpc-dotnet
v2.61
, but the problem exists also inv2.62.x
It could be reproduced at least under Linux and MacOS (ARM CPU)
I used .NET SDK
8.0.202
and runtime8.0.3
to reproduce it but I don't think it matters.I couldn't find exact place where it could be fixed quickly so I really hoping on you to lookup the problem and fix it.
Best regards
The text was updated successfully, but these errors were encountered: