Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The first retry does not require backoff #490

Closed
wants to merge 1 commit into from

Conversation

carterkozak
Copy link
Contributor

There are many scenarios where requests fail and can be retried
immediately. This can produce higher load, but it's limited to
the first retry, not the total allowed retries.
In many scenarios retries occur on another node producing
even greater probability of success.

Before this PR

Any time a request failed (potentially because the channel was limited) it would wait to be retried.

After this PR

==COMMIT_MSG==
The first retry does not incur a backoff penalty
==COMMIT_MSG==

Possible downsides?

Failures produce more immediate load, two requests inside of a millisecond instead of waiting. This risk doesn't scale with retries because backoffs are present after the first retry.
As requests traverse services deeper in the stack there is more risk that doubling the initial spike may produce load, but a factor of two is more palatable than the risk using our previous immediate retry and queue approach, this will provide insight into the benefits and risks of such a design while limiting the downside.

@changelog-app
Copy link

changelog-app bot commented Mar 5, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

The first retry does not incur a backoff penalty

Check the box to generate changelog(s)

  • Generate changelog entry

@ferozco
Copy link
Contributor

ferozco commented Mar 5, 2020

Something seems off with the simulation, its not clear why the metrics added #486 are only showing up now

@carterkozak
Copy link
Contributor Author

The images contain a superset of the data we verify against, they don't appear to be regenerated unless the report is modified.

@carterkozak
Copy link
Contributor Author

forced them to update here: #491

@ferozco
Copy link
Contributor

ferozco commented Mar 5, 2020

Since both pinUntilError and RoundRobin will move to the next node if the request fails or if the channel is limited this immediate retry should always hit a subsequent node.

As a side note: it seems slightly unexpected that pin until error effectively round robins through its channels until it finds a channel that is not limited

@carterkozak
Copy link
Contributor Author

it seems slightly unexpected that pin until error effectively round robins through its channels until it finds a channel that is not limited

Agreed, I don't think we have a great definition for what it means to be limited, I can imagine strong arguments in either direction.

There are many scenarios where requests fail and can be retried
immediately. This can produce higher load, but it's limited to
the first retry, not the total allowed retries.
In many scenarios retries occur on another node producing
even greater probability of success.
@carterkozak carterkozak force-pushed the ckozak/first_retry_without_backoff branch from 679cbef to 836439b Compare March 5, 2020 16:23
@ferozco
Copy link
Contributor

ferozco commented Mar 5, 2020

The point i'm getting at is that this change seems to only be beneficial if your node selection strategy causes the immediate retry to attempt on a different node otherwise you're effectively burning a retry. Hence coupling the retryer to the node selection strategy. Therefore I'm not sure we want to make this change right now, especially if we're not in agreement on how pinUntilError should work in the face of limited channels

@carterkozak
Copy link
Contributor Author

Happy to punt this for now.

I'm not sure we'd necessarily be burning a retry though our simulation makes it seem that way. Our simulations don't currently simulate the sort of network layer flakes that this would improve. Not something worth spending much time optimizing for now.

@carterkozak carterkozak closed this Mar 5, 2020
@iamdanfox iamdanfox deleted the ckozak/first_retry_without_backoff branch May 12, 2020 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants