Retry GET & HEAD requests on server 500s #629

iamdanfox · 2020-04-08T23:12:56Z

Before this PR

Our simulations show a lot of user-facing failures because whenever a server returns a 500 we just give up and die. Obviously retrying all 500s is too dangerous because we servers may have already performed side-effects before they fail, leading to multiple unintended side-effects. We also don't have a first-class concept of 'idempotence' in conjure.

HOWEVER, it turns out RFC2731 actually defines which http methods are ok to retry, by specifying a concept of safe and idempotent.

After this PR

==COMMIT_MSG==
dialogue now retries GET and HEAD requests after server 500s.
==COMMIT_MSG==

(Amusingly, it seems nginx actually did retry everything for a while and only stopped resending POST requests in 2016).

GRAPHS

Possible downsides?

people may have defined non-idempotent endpoints using GET/HEAD/PUT/DELETE methods, and this PR would cause them to receive retries, possibly triggering duplicate side-effects. I think this is unlikely, because nginx and squid already retry these.
for a truly-broken server, this will increase the time it takes a user to receive a 5xx response
if a server suddenly starts returning 5xxs for everything, it might suddenly get 4 x the request rate because all clients will start retrying a bunch!

changelog-app · 2020-04-08T23:13:01Z

Generate changelog in `changelog/@unreleased`

Type

Description

dialogue now retries GET and HEAD requests after server 500s.

Check the box to generate changelog(s)

Generate changelog entry

iamdanfox · 2020-04-08T23:17:55Z

simulation/src/test/resources/report.md

-one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=64.3%	client_mean=PT0.6036528S   	server_cpu=PT25M          	client_received=2500/2500	server_resps=2500	codes={200=1608, 500=892}
- one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=65.5%	client_mean=PT0.6S         	server_cpu=PT25M          	client_received=2500/2500	server_resps=2500	codes={200=1638, 500=862}
-           one_endpoint_dies_on_each_server[UNLIMITED_ROUND_ROBIN].txt:	success=65.6%	client_mean=PT0.6S         	server_cpu=PT25M          	client_received=2500/2500	server_resps=2500	codes={200=1639, 500=861}
+one_endpoint_dies_on_each_server[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=97.6%	client_mean=PT1.970194456S 	server_cpu=PT41M42.6S     	client_received=2500/2500	server_resps=4171	codes={200=2441, 500=59}


Big improvements in user-perceived success rate (at the cost of slightly slower mean): https://github.com/palantir/dialogue/blob/dfox/retry-5xx/simulation/src/test/resources/report.md#one_endpoint_dies_on_each_serverconcurrency_limiter_pin_until_error

iamdanfox · 2020-04-08T23:22:24Z

simulation/src/test/resources/report.md

                   black_hole[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=30.0%	client_mean=PT0.6S         	server_cpu=PT6M           	client_received=600/2000	server_resps=600	codes={200=600}
                       black_hole[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=90.0%	client_mean=PT0.600011117S 	server_cpu=PT17M59.4S     	client_received=1799/2000	server_resps=1799	codes={200=1799}
                                 black_hole[UNLIMITED_ROUND_ROBIN].txt:	success=68.3%	client_mean=PT0.6S         	server_cpu=PT13M39.6S     	client_received=1366/2000	server_resps=1366	codes={200=1366}
             drastic_slowdown[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=100.0%	client_mean=PT0.075961S    	server_cpu=PT5M3.844S     	client_received=4000/4000	server_resps=4000	codes={200=4000}
                 drastic_slowdown[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=100.0%	client_mean=PT2.060500083S 	server_cpu=PT2H17M22.000333313S	client_received=4000/4000	server_resps=4000	codes={200=4000}
                           drastic_slowdown[UNLIMITED_ROUND_ROBIN].txt:	success=100.0%	client_mean=PT9.158069333S 	server_cpu=PT10H10M32.277333313S	client_received=4000/4000	server_resps=4000	codes={200=4000}
        fast_500s_then_revert[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=100.0%	client_mean=PT0.080925333S 	server_cpu=PT5M3.47S      	client_received=3750/3750	server_resps=3750	codes={200=3750}
-            fast_500s_then_revert[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=76.0%	client_mean=PT0.055333733S 	server_cpu=PT3M27.501499711S	client_received=3750/3750	server_resps=3750	codes={200=2849, 500=901}
-                      fast_500s_then_revert[UNLIMITED_ROUND_ROBIN].txt:	success=76.0%	client_mean=PT0.055333733S 	server_cpu=PT3M27.501499711S	client_received=3750/3750	server_resps=3750	codes={200=2849, 500=901}
+            fast_500s_then_revert[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=99.2%	client_mean=PT0.173513629S 	server_cpu=PT4M53.194832836S	client_received=3750/3750	server_resps=5217	codes={200=3721, 500=29}


dramatic improvement here too. I know multipass had a campaign of getting all their heavy users to switch to ROUNB_ROBIN, so I think this is not pointless.

https://github.com/palantir/dialogue/blob/dfox/retry-5xx/simulation/src/test/resources/report.md#fast_500s_then_revertconcurrency_limiter_round_robin

carterkozak · 2020-04-09T00:17:19Z

I think our nginx instances only retry 503 and connection failures, not based on other 5xx responses, but I may be misremembering.

I'm generally in favor of this idea, but worried about wrecking non-idempotent PUT endpoints where this could have catastrophic consequences. Worth a broad email and bit of runway before we roll it out broadly. Let's discuss tomorrow.

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java

iamdanfox · 2020-04-09T13:48:21Z

Also mention in the README, possibly include the keyword may in the conjure spec

dialogue-core/src/main/metrics/dialogue-core-metrics.yml

…nto dfox/retry-5xx

iamdanfox · 2020-04-09T23:15:29Z

Just a thought, do you think it would be interesting to also get metrics for when we fully exhaust retries? This seems kinda interesting in the case of a big spike of requests where concurrency limiters quickly become saturated

carterkozak · 2020-04-09T23:24:52Z

I put this together a while ago but never merged it: #527

iamdanfox · 2020-04-09T23:27:32Z

Mmm interesting. Maybe another way would be to just dial up the Exhausted {} retries, returning a retryable response with status {} log message from debug -> info? Seems like it shouldn't be noisy most of the time

carterkozak · 2020-04-09T23:32:40Z

It depends, we have a separate debug log line for the retryable http response path as well as the exception path. I don't think we want to log the exception path at info because the failure will be thrown at the callsite and eventually logged elsewhere, but I suppose we could log the rest of that data at info and leave out the trace?

carterkozak · 2020-04-14T16:44:43Z

simulation/src/test/resources/txt/all_nodes_500[CONCURRENCY_LIMITER_ROUND_ROBIN].txt

I'm not sure we want to simulate using GET requests as they're underrepresented across the fleet. Thoughts?

carterkozak · 2020-04-14T17:00:28Z

I'm happy with this 👍, it's as safe as it can be :-)

svc-autorelease · 2020-04-14T17:01:30Z

Released 1.17.0

iamdanfox added 3 commits April 8, 2020 23:55

RetryingChannel retries after 5xx (on safe and idempotent endpoints)

79eed84

Update simulations

135feca

Add generated changelog entries

ad412aa

probot-autolabeler bot added the autorelease label Apr 8, 2020

iamdanfox commented Apr 8, 2020

View reviewed changes

iamdanfox added 4 commits April 9, 2020 02:21

Merge remote-tracking branch 'origin/develop' into dfox/retry-5xx

b4f3d0b

PIN_UNTIL_ERROR looks amazing now

9957a7c

Merge remote-tracking branch 'origin/develop' into dfox/retry-5xx

c21fdb7

Simulate again

072fcf1

iamdanfox commented Apr 9, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java Outdated Show resolved Hide resolved

iamdanfox commented Apr 9, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java Show resolved Hide resolved

iamdanfox added 2 commits April 9, 2020 23:17

Only retry 500s (GET/HEAD only, not PUT/DELETE)

ac37e25

instrumentation

701273d

iamdanfox changed the title ~~Retry 5xx for 'safe' and idempotent requests only~~ Retry GET & HEAD requests on server 500s Apr 9, 2020

carterkozak reviewed Apr 9, 2020

View reviewed changes

dialogue-core/src/main/metrics/dialogue-core-metrics.yml Outdated Show resolved Hide resolved

iamdanfox added 4 commits April 9, 2020 23:55

Switch to meter

b9ab3c4

Add generated changelog entries

b904d22

Switch to meter

3947bb9

Merge branch 'dfox/retry-5xx' of ssh://github.com/palantir/dialogue i…

bce81e7

…nto dfox/retry-5xx

INFO log line when retries exhausted

6599206

iamdanfox added the merge when ready label Apr 9, 2020

Merge remote-tracking branch 'origin/develop' into dfox/retry-5xx

c46ca76

Merge remote-tracking branch 'origin/develop' into dfox/retry-5xx

64fd558

carterkozak reviewed Apr 14, 2020

View reviewed changes

iamdanfox added 2 commits April 14, 2020 17:54

Merge remote-tracking branch 'origin/develop' into dfox/retry-5xx

a3fb790

Simulate using POST requests

2fe1411

bulldozer-bot bot merged commit 1d55b61 into develop Apr 14, 2020

bulldozer-bot bot deleted the dfox/retry-5xx branch April 14, 2020 17:01

iamdanfox mentioned this pull request Apr 14, 2020

[test-only] One simulation uses a GET endpoint #646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry GET & HEAD requests on server 500s #629

Retry GET & HEAD requests on server 500s #629

iamdanfox commented Apr 8, 2020 •

edited

Loading

changelog-app bot commented Apr 8, 2020 •

edited by iamdanfox

Loading

iamdanfox Apr 8, 2020 •

edited

Loading

iamdanfox Apr 8, 2020

carterkozak commented Apr 9, 2020

iamdanfox commented Apr 9, 2020 •

edited

Loading

iamdanfox commented Apr 9, 2020

carterkozak commented Apr 9, 2020

iamdanfox commented Apr 9, 2020 •

edited

Loading

carterkozak commented Apr 9, 2020

carterkozak Apr 14, 2020

carterkozak commented Apr 14, 2020

svc-autorelease commented Apr 14, 2020

Retry GET & HEAD requests on server 500s #629

Retry GET & HEAD requests on server 500s #629

Conversation

iamdanfox commented Apr 8, 2020 • edited Loading

Before this PR

After this PR

GRAPHS

Possible downsides?

changelog-app bot commented Apr 8, 2020 • edited by iamdanfox Loading

Generate changelog in changelog/@unreleased

iamdanfox Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

iamdanfox Apr 8, 2020

Choose a reason for hiding this comment

carterkozak commented Apr 9, 2020

iamdanfox commented Apr 9, 2020 • edited Loading

iamdanfox commented Apr 9, 2020

carterkozak commented Apr 9, 2020

iamdanfox commented Apr 9, 2020 • edited Loading

carterkozak commented Apr 9, 2020

carterkozak Apr 14, 2020

Choose a reason for hiding this comment

carterkozak commented Apr 14, 2020

svc-autorelease commented Apr 14, 2020

iamdanfox commented Apr 8, 2020 •

edited

Loading

changelog-app bot commented Apr 8, 2020 •

edited by iamdanfox

Loading

Generate changelog in `changelog/@unreleased`

iamdanfox Apr 8, 2020 •

edited

Loading

iamdanfox commented Apr 9, 2020 •

edited

Loading

iamdanfox commented Apr 9, 2020 •

edited

Loading