The first retry does not require backoff #490

carterkozak · 2020-03-05T15:36:14Z

There are many scenarios where requests fail and can be retried
immediately. This can produce higher load, but it's limited to
the first retry, not the total allowed retries.
In many scenarios retries occur on another node producing
even greater probability of success.

Before this PR

Any time a request failed (potentially because the channel was limited) it would wait to be retried.

After this PR

==COMMIT_MSG==
The first retry does not incur a backoff penalty
==COMMIT_MSG==

Possible downsides?

Failures produce more immediate load, two requests inside of a millisecond instead of waiting. This risk doesn't scale with retries because backoffs are present after the first retry.
As requests traverse services deeper in the stack there is more risk that doubling the initial spike may produce load, but a factor of two is more palatable than the risk using our previous immediate retry and queue approach, this will provide insight into the benefits and risks of such a design while limiting the downside.

changelog-app · 2020-03-05T15:36:19Z

Generate changelog in `changelog/@unreleased`

Type

Description

The first retry does not incur a backoff penalty

Check the box to generate changelog(s)

Generate changelog entry

ferozco · 2020-03-05T15:47:44Z

Something seems off with the simulation, its not clear why the metrics added #486 are only showing up now

carterkozak · 2020-03-05T15:53:25Z

The images contain a superset of the data we verify against, they don't appear to be regenerated unless the report is modified.

carterkozak · 2020-03-05T15:59:34Z

forced them to update here: #491

ferozco · 2020-03-05T16:13:59Z

Since both pinUntilError and RoundRobin will move to the next node if the request fails or if the channel is limited this immediate retry should always hit a subsequent node.

As a side note: it seems slightly unexpected that pin until error effectively round robins through its channels until it finds a channel that is not limited

carterkozak · 2020-03-05T16:20:44Z

it seems slightly unexpected that pin until error effectively round robins through its channels until it finds a channel that is not limited

Agreed, I don't think we have a great definition for what it means to be limited, I can imagine strong arguments in either direction.

There are many scenarios where requests fail and can be retried immediately. This can produce higher load, but it's limited to the first retry, not the total allowed retries. In many scenarios retries occur on another node producing even greater probability of success.

ferozco · 2020-03-05T16:32:10Z

The point i'm getting at is that this change seems to only be beneficial if your node selection strategy causes the immediate retry to attempt on a different node otherwise you're effectively burning a retry. Hence coupling the retryer to the node selection strategy. Therefore I'm not sure we want to make this change right now, especially if we're not in agreement on how pinUntilError should work in the face of limited channels

carterkozak · 2020-03-05T16:39:50Z

Happy to punt this for now.

I'm not sure we'd necessarily be burning a retry though our simulation makes it seem that way. Our simulations don't currently simulate the sort of network layer flakes that this would improve. Not something worth spending much time optimizing for now.

carterkozak requested a review from ferozco March 5, 2020 15:36

probot-autolabeler bot added the autorelease label Mar 5, 2020

carterkozak force-pushed the ckozak/first_retry_without_backoff branch from 679cbef to 836439b Compare March 5, 2020 16:23

carterkozak closed this Mar 5, 2020

iamdanfox deleted the ckozak/first_retry_without_backoff branch May 12, 2020 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first retry does not require backoff #490

The first retry does not require backoff #490

carterkozak commented Mar 5, 2020

changelog-app bot commented Mar 5, 2020 •

edited by carterkozak

Loading

ferozco commented Mar 5, 2020

carterkozak commented Mar 5, 2020

carterkozak commented Mar 5, 2020

ferozco commented Mar 5, 2020

carterkozak commented Mar 5, 2020

ferozco commented Mar 5, 2020

carterkozak commented Mar 5, 2020

The first retry does not require backoff #490

The first retry does not require backoff #490

Conversation

carterkozak commented Mar 5, 2020

Before this PR

After this PR

Possible downsides?

changelog-app bot commented Mar 5, 2020 • edited by carterkozak Loading

Generate changelog in changelog/@unreleased

ferozco commented Mar 5, 2020

carterkozak commented Mar 5, 2020

carterkozak commented Mar 5, 2020

ferozco commented Mar 5, 2020

carterkozak commented Mar 5, 2020

ferozco commented Mar 5, 2020

carterkozak commented Mar 5, 2020

changelog-app bot commented Mar 5, 2020 •

edited by carterkozak

Loading

Generate changelog in `changelog/@unreleased`