-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry GET & HEAD requests on server 500s #629
Conversation
Generate changelog in
|
one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=64.3% client_mean=PT0.6036528S server_cpu=PT25M client_received=2500/2500 server_resps=2500 codes={200=1608, 500=892} | ||
one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=65.5% client_mean=PT0.6S server_cpu=PT25M client_received=2500/2500 server_resps=2500 codes={200=1638, 500=862} | ||
one_endpoint_dies_on_each_server[UNLIMITED_ROUND_ROBIN].txt: success=65.6% client_mean=PT0.6S server_cpu=PT25M client_received=2500/2500 server_resps=2500 codes={200=1639, 500=861} | ||
one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=97.6% client_mean=PT1.970194456S server_cpu=PT41M42.6S client_received=2500/2500 server_resps=4171 codes={200=2441, 500=59} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big improvements in user-perceived success rate (at the cost of slightly slower mean): https://github.com/palantir/dialogue/blob/dfox/retry-5xx/simulation/src/test/resources/report.md#one_endpoint_dies_on_each_serverconcurrency_limiter_pin_until_error
black_hole[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=30.0% client_mean=PT0.6S server_cpu=PT6M client_received=600/2000 server_resps=600 codes={200=600} | ||
black_hole[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=90.0% client_mean=PT0.600011117S server_cpu=PT17M59.4S client_received=1799/2000 server_resps=1799 codes={200=1799} | ||
black_hole[UNLIMITED_ROUND_ROBIN].txt: success=68.3% client_mean=PT0.6S server_cpu=PT13M39.6S client_received=1366/2000 server_resps=1366 codes={200=1366} | ||
drastic_slowdown[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT0.075961S server_cpu=PT5M3.844S client_received=4000/4000 server_resps=4000 codes={200=4000} | ||
drastic_slowdown[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=100.0% client_mean=PT2.060500083S server_cpu=PT2H17M22.000333313S client_received=4000/4000 server_resps=4000 codes={200=4000} | ||
drastic_slowdown[UNLIMITED_ROUND_ROBIN].txt: success=100.0% client_mean=PT9.158069333S server_cpu=PT10H10M32.277333313S client_received=4000/4000 server_resps=4000 codes={200=4000} | ||
fast_500s_then_revert[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT0.080925333S server_cpu=PT5M3.47S client_received=3750/3750 server_resps=3750 codes={200=3750} | ||
fast_500s_then_revert[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=76.0% client_mean=PT0.055333733S server_cpu=PT3M27.501499711S client_received=3750/3750 server_resps=3750 codes={200=2849, 500=901} | ||
fast_500s_then_revert[UNLIMITED_ROUND_ROBIN].txt: success=76.0% client_mean=PT0.055333733S server_cpu=PT3M27.501499711S client_received=3750/3750 server_resps=3750 codes={200=2849, 500=901} | ||
fast_500s_then_revert[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=99.2% client_mean=PT0.173513629S server_cpu=PT4M53.194832836S client_received=3750/3750 server_resps=5217 codes={200=3721, 500=29} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dramatic improvement here too. I know multipass had a campaign of getting all their heavy users to switch to ROUNB_ROBIN, so I think this is not pointless.
I think our nginx instances only retry 503 and connection failures, not based on other 5xx responses, but I may be misremembering. I'm generally in favor of this idea, but worried about wrecking non-idempotent PUT endpoints where this could have catastrophic consequences. Worth a broad email and bit of runway before we roll it out broadly. Let's discuss tomorrow. |
dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java
Show resolved
Hide resolved
Also mention in the README, possibly include the keyword |
Just a thought, do you think it would be interesting to also get metrics for when we fully exhaust retries? This seems kinda interesting in the case of a big spike of requests where concurrency limiters quickly become saturated |
I put this together a while ago but never merged it: #527 |
Mmm interesting. Maybe another way would be to just dial up the |
It depends, we have a separate debug log line for the retryable http response path as well as the exception path. I don't think we want to log the exception path at info because the failure will be thrown at the callsite and eventually logged elsewhere, but I suppose we could log the rest of that data at info and leave out the trace? |
success=100.0% client_mean=PT4.921208174S server_cpu=PT25M43.2S client_received=2000/2000 server_resps=2572 codes={200=2000} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to simulate using GET requests as they're underrepresented across the fleet. Thoughts?
I'm happy with this 👍, it's as safe as it can be :-) |
Released 1.17.0 |
Before this PR
Our simulations show a lot of user-facing failures because whenever a server returns a 500 we just give up and die. Obviously retrying all 500s is too dangerous because we servers may have already performed side-effects before they fail, leading to multiple unintended side-effects. We also don't have a first-class concept of 'idempotence' in conjure.
HOWEVER, it turns out RFC2731 actually defines which http methods are ok to retry, by specifying a concept of
safe
andidempotent
.After this PR
==COMMIT_MSG==
dialogue now retries GET and HEAD requests after server 500s.
==COMMIT_MSG==
(Amusingly, it seems nginx actually did retry everything for a while and only stopped resending POST requests in 2016).
GRAPHS
Possible downsides?