-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The first retry does not require backoff #490
Conversation
Generate changelog in
|
Something seems off with the simulation, its not clear why the metrics added #486 are only showing up now |
The images contain a superset of the data we verify against, they don't appear to be regenerated unless the report is modified. |
forced them to update here: #491 |
Since both pinUntilError and RoundRobin will move to the next node if the request fails or if the channel is limited this immediate retry should always hit a subsequent node. As a side note: it seems slightly unexpected that pin until error effectively round robins through its channels until it finds a channel that is not limited |
Agreed, I don't think we have a great definition for what it means to be limited, I can imagine strong arguments in either direction. |
There are many scenarios where requests fail and can be retried immediately. This can produce higher load, but it's limited to the first retry, not the total allowed retries. In many scenarios retries occur on another node producing even greater probability of success.
679cbef
to
836439b
Compare
The point i'm getting at is that this change seems to only be beneficial if your node selection strategy causes the immediate retry to attempt on a different node otherwise you're effectively burning a retry. Hence coupling the retryer to the node selection strategy. Therefore I'm not sure we want to make this change right now, especially if we're not in agreement on how pinUntilError should work in the face of limited channels |
Happy to punt this for now. I'm not sure we'd necessarily be burning a retry though our simulation makes it seem that way. Our simulations don't currently simulate the sort of network layer flakes that this would improve. Not something worth spending much time optimizing for now. |
There are many scenarios where requests fail and can be retried
immediately. This can produce higher load, but it's limited to
the first retry, not the total allowed retries.
In many scenarios retries occur on another node producing
even greater probability of success.
Before this PR
Any time a request failed (potentially because the channel was limited) it would wait to be retried.
After this PR
==COMMIT_MSG==
The first retry does not incur a backoff penalty
==COMMIT_MSG==
Possible downsides?
Failures produce more immediate load, two requests inside of a millisecond instead of waiting. This risk doesn't scale with retries because backoffs are present after the first retry.
As requests traverse services deeper in the stack there is more risk that doubling the initial spike may produce load, but a factor of two is more palatable than the risk using our previous immediate retry and queue approach, this will provide insight into the benefits and risks of such a design while limiting the downside.