Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry GET & HEAD requests on server 500s #629

Merged
merged 18 commits into from
Apr 14, 2020
Merged

Conversation

iamdanfox
Copy link
Contributor

@iamdanfox iamdanfox commented Apr 8, 2020

Before this PR

Our simulations show a lot of user-facing failures because whenever a server returns a 500 we just give up and die. Obviously retrying all 500s is too dangerous because we servers may have already performed side-effects before they fail, leading to multiple unintended side-effects. We also don't have a first-class concept of 'idempotence' in conjure.

HOWEVER, it turns out RFC2731 actually defines which http methods are ok to retry, by specifying a concept of safe and idempotent.

After this PR

==COMMIT_MSG==
dialogue now retries GET and HEAD requests after server 500s.
==COMMIT_MSG==

(Amusingly, it seems nginx actually did retry everything for a while and only stopped resending POST requests in 2016).

GRAPHS

Possible downsides?

  • people may have defined non-idempotent endpoints using GET/HEAD/PUT/DELETE methods, and this PR would cause them to receive retries, possibly triggering duplicate side-effects. I think this is unlikely, because nginx and squid already retry these.
  • for a truly-broken server, this will increase the time it takes a user to receive a 5xx response
  • if a server suddenly starts returning 5xxs for everything, it might suddenly get 4 x the request rate because all clients will start retrying a bunch!

@changelog-app
Copy link

changelog-app bot commented Apr 8, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

dialogue now retries GET and HEAD requests after server 500s.

Check the box to generate changelog(s)

  • Generate changelog entry

one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=64.3% client_mean=PT0.6036528S server_cpu=PT25M client_received=2500/2500 server_resps=2500 codes={200=1608, 500=892}
one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=65.5% client_mean=PT0.6S server_cpu=PT25M client_received=2500/2500 server_resps=2500 codes={200=1638, 500=862}
one_endpoint_dies_on_each_server[UNLIMITED_ROUND_ROBIN].txt: success=65.6% client_mean=PT0.6S server_cpu=PT25M client_received=2500/2500 server_resps=2500 codes={200=1639, 500=861}
one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=97.6% client_mean=PT1.970194456S server_cpu=PT41M42.6S client_received=2500/2500 server_resps=4171 codes={200=2441, 500=59}
Copy link
Contributor Author

@iamdanfox iamdanfox Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black_hole[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=30.0% client_mean=PT0.6S server_cpu=PT6M client_received=600/2000 server_resps=600 codes={200=600}
black_hole[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=90.0% client_mean=PT0.600011117S server_cpu=PT17M59.4S client_received=1799/2000 server_resps=1799 codes={200=1799}
black_hole[UNLIMITED_ROUND_ROBIN].txt: success=68.3% client_mean=PT0.6S server_cpu=PT13M39.6S client_received=1366/2000 server_resps=1366 codes={200=1366}
drastic_slowdown[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT0.075961S server_cpu=PT5M3.844S client_received=4000/4000 server_resps=4000 codes={200=4000}
drastic_slowdown[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=100.0% client_mean=PT2.060500083S server_cpu=PT2H17M22.000333313S client_received=4000/4000 server_resps=4000 codes={200=4000}
drastic_slowdown[UNLIMITED_ROUND_ROBIN].txt: success=100.0% client_mean=PT9.158069333S server_cpu=PT10H10M32.277333313S client_received=4000/4000 server_resps=4000 codes={200=4000}
fast_500s_then_revert[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT0.080925333S server_cpu=PT5M3.47S client_received=3750/3750 server_resps=3750 codes={200=3750}
fast_500s_then_revert[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=76.0% client_mean=PT0.055333733S server_cpu=PT3M27.501499711S client_received=3750/3750 server_resps=3750 codes={200=2849, 500=901}
fast_500s_then_revert[UNLIMITED_ROUND_ROBIN].txt: success=76.0% client_mean=PT0.055333733S server_cpu=PT3M27.501499711S client_received=3750/3750 server_resps=3750 codes={200=2849, 500=901}
fast_500s_then_revert[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=99.2% client_mean=PT0.173513629S server_cpu=PT4M53.194832836S client_received=3750/3750 server_resps=5217 codes={200=3721, 500=29}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dramatic improvement here too. I know multipass had a campaign of getting all their heavy users to switch to ROUNB_ROBIN, so I think this is not pointless.

https://github.com/palantir/dialogue/blob/dfox/retry-5xx/simulation/src/test/resources/report.md#fast_500s_then_revertconcurrency_limiter_round_robin

@carterkozak
Copy link
Contributor

I think our nginx instances only retry 503 and connection failures, not based on other 5xx responses, but I may be misremembering.

I'm generally in favor of this idea, but worried about wrecking non-idempotent PUT endpoints where this could have catastrophic consequences. Worth a broad email and bit of runway before we roll it out broadly. Let's discuss tomorrow.

@iamdanfox
Copy link
Contributor Author

iamdanfox commented Apr 9, 2020

Also mention in the README, possibly include the keyword may in the conjure spec

@iamdanfox iamdanfox changed the title Retry 5xx for 'safe' and idempotent requests only Retry GET & HEAD requests on server 500s Apr 9, 2020
@iamdanfox
Copy link
Contributor Author

Just a thought, do you think it would be interesting to also get metrics for when we fully exhaust retries? This seems kinda interesting in the case of a big spike of requests where concurrency limiters quickly become saturated

@carterkozak
Copy link
Contributor

I put this together a while ago but never merged it: #527

@iamdanfox
Copy link
Contributor Author

iamdanfox commented Apr 9, 2020

Mmm interesting. Maybe another way would be to just dial up the Exhausted {} retries, returning a retryable response with status {} log message from debug -> info? Seems like it shouldn't be noisy most of the time

@carterkozak
Copy link
Contributor

It depends, we have a separate debug log line for the retryable http response path as well as the exception path. I don't think we want to log the exception path at info because the failure will be thrown at the callsite and eventually logged elsewhere, but I suppose we could log the rest of that data at info and leave out the trace?

success=100.0% client_mean=PT4.921208174S server_cpu=PT25M43.2S client_received=2000/2000 server_resps=2572 codes={200=2000}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to simulate using GET requests as they're underrepresented across the fleet. Thoughts?

@carterkozak
Copy link
Contributor

I'm happy with this 👍, it's as safe as it can be :-)

@bulldozer-bot bulldozer-bot bot merged commit 1d55b61 into develop Apr 14, 2020
@bulldozer-bot bulldozer-bot bot deleted the dfox/retry-5xx branch April 14, 2020 17:01
@svc-autorelease
Copy link
Collaborator

Released 1.17.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants