fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432

carterkozak · 2020-02-25T21:46:05Z

==COMMIT_MSG==
Replace QueuedChannel with a backoff based retryer
==COMMIT_MSG==

changelog-app · 2020-02-25T21:46:10Z

Generate changelog in `changelog/@unreleased`

Type

Description

Replace QueuedChannel with a backoff based retryer

Check the box to generate changelog(s)

Generate changelog entry

iamdanfox · 2020-02-26T01:37:20Z

dialogue-core/src/main/java/com/palantir/dialogue/core/ExponentialBackoffStrategy.java

+
+    ExponentialBackoffStrategy(Duration backoffSlotSize, DoubleSupplier random) {
+        this.backoffSlotSize = backoffSlotSize;
+        this.random = random;


s/random/jitter/ - it's fine for this to always return 1 in testing, and be a random implementation in prod?

It's important to get jitter in our simulations as well, otherwise the we may cause waves of requests. In tests using a constant is fine, but ideally we don't measure wall clock time in tests and watch the request rate/etc.

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java

iamdanfox · 2020-02-26T01:42:14Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java

+                return 0L;
+            }
+            int upperBound = (int) Math.pow(2, failures - 1);
+            return Math.round(backoffSlotSize.toNanos() * random.getAsDouble() * upperBound);


We're gonna plumb the BackoffStrategy thing in here right?

We could, but I don't like it as written in CJR, I find it confusing that getting the duration mutates state and increments failures. I had an implementation that took a backoffstrategy as a function from failure count to backoff duration that worked fine, but without needing alternative implementations it didn't add anything.
Thoughts?

Yeah definitely would like to not do the mutation as part of the getter. Fair enough that if we're always going to use the exponential strategy, then maybe factoring out is unnecessary!

iamdanfox · 2020-02-26T02:01:08Z

Since this is quite a big change in behaviour, I'd like to try and articulate the pros and cons of this switch.

an upside of the old 'queued channel & instant retry' thing was if there's at least one channel which is not currently limited, then you should be able to get a request out the door instantly. Now however, you might get the "Failed to make a request" exception which causes a backoff if the one node you were pinned to limited you

carterkozak · 2020-02-26T02:07:11Z

Upsides:

This allows us to implement per-endpoint limiting because we don’t need to worry about a queue. (We could make limitedChannel extend channel and try to find a non-limited channel before attempting for force our way through, trying a channel+endpoint combination that is likely broken)
Easier to reason about because a single limited request doesn’t cause jitter for other requests triggered from separate threads slightly later

carterkozak · 2020-02-26T02:25:04Z

(sorry to split these, poking at this between cleaning/dishes/etc)
Downsides:

Potentially lower throughput due to backoff rather than edge-triggering. In practice the BlacklistingChannel resulted in an approximation of backoff without the upsides, and it was too easy to introduce regressions.

…l use" This reverts commit e83923f.

iamdanfox · 2020-02-26T16:18:54Z

dialogue-core/src/main/java/com/palantir/dialogue/core/Channels.java

+        if (config.maxNumRetries() > 0) {
+            channel =
+                    new RetryingChannel(channel, config.maxNumRetries(), config.backoffSlotSize(), config.serverQoS());
+        }


cute little optimization :)

dialogue-core/src/main/java/com/palantir/dialogue/core/ExponentialBackoffStrategy.java

iamdanfox · 2020-02-26T16:21:11Z

dialogue-core/src/main/java/com/palantir/dialogue/core/QueuedChannel.java

-                return false;
-            }
-        }
-    }


so much spiciness into the bin 🌶

iamdanfox · 2020-02-26T16:24:43Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java

            if (log.isInfoEnabled()) {
                log.info(
                        "Retrying call after failure",
                        SafeArg.of("failures", failures),
                        SafeArg.of("maxRetries", maxRetries),
+                        SafeArg.of("backoffNanoseconds", backoffNanoseconds),


could we do a TimeUnit.Milliseconds.convert here and present backoffMillis instead? It's really hard to eyeball such big numbers when they appear in the logs

iamdanfox · 2020-02-26T16:25:53Z

dialogue-core/src/test/resources/tracing/ChannelsTest/traces_on_retries.log

-{"traceId":"3af31cbecda2b46c","parentSpanId":"697e878d5bfc5628","spanId":"28686cc7c2695fb2","type":"LOCAL","operation":"Dialogue-request-attempt","startTimeMicroSeconds":1582562924534905,"durationNanoSeconds":577068,"metadata":{}}
-{"traceId":"3af31cbecda2b46c","parentSpanId":"efaf41a21cc7c6a9","spanId":"697e878d5bfc5628","type":"LOCAL","operation":"Dialogue-request initial","startTimeMicroSeconds":1582562924525312,"durationNanoSeconds":13461016,"metadata":{}}
-{"traceId":"3af31cbecda2b46c","parentSpanId":null,"spanId":"efaf41a21cc7c6a9","type":"LOCAL","operation":"Dialogue-request","startTimeMicroSeconds":1582562924525264,"durationNanoSeconds":13530208,"metadata":{}}
+{"traceId":"09b26a1be498af5c","parentSpanId":"49c88b0d4a7b609a","spanId":"802d8f9c0d0333a7","type":"LOCAL","operation":"Dialogue-http-request initial","startTimeMicroSeconds":1582681721129000,"durationNanoSeconds":7754255,"metadata":{}}


https://5436-164943450-gh.circle-artifacts.com/0/~/artifacts/ChannelsTest/traces_on_retries/actual.html

There is a lot of empty space in these - can we get a span containing the backoff duration in there somehow? (Could be as a FLUP)

Ya, I'd prefer to do this in a follow-up

simulation/src/test/java/com/palantir/dialogue/core/Strategy.java

iamdanfox · 2020-02-26T16:33:20Z

simulation/src/test/resources/report.md

-         live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt:	success=82.4%	client_mean=PT4.5664688S   	server_cpu=PT1H52M14.26S  	client_received=2500/2500	server_resps=2500	codes={200=2061, 500=439}
-               live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=59.0%	client_mean=PT3.5693656S   	server_cpu=PT1H58M42.9S   	client_received=2500/2500	server_resps=2500	codes={200=1476, 500=1024}
-                   live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=58.6%	client_mean=PT3.5376608S   	server_cpu=PT1H58M19S     	client_received=2500/2500	server_resps=2500	codes={200=1466, 500=1034}
+         live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt:	success=59.3%	client_mean=PT2.717022131S 	server_cpu=PT1H21M18.049029538S	client_received=2500/2500	server_resps=1865	codes={200=1483, 500=382, Failed to make a request=635}


I think the 82% -> 59% success makes sense because a bunch of requests are no longer hanging out on the queue, and instead we get a chunk of them come back as "failed to make a request"
https://github.com/palantir/dialogue/blob/ckozak/no_queue/simulation/src/test/resources/report.md#live_reloadingconcurrency_limiter_blacklist_round_robin

iamdanfox · 2020-02-26T16:38:23Z

dialogue-core/src/main/java/com/palantir/dialogue/core/LimitedChannelToChannelAdapter.java

+    // Avoid method reference allocations
+    @SuppressWarnings("UnnecessaryLambda")
+    private static final Supplier<ListenableFuture<Response>> limitedResultSupplier =
+            () -> Futures.immediateFailedFuture(new SafeRuntimeException("Failed to make a request"));


I think we're going to need to give people more information in this exception message (or possibly a comment here) otherwise we're going to be fielding questions in #dev-foundry-infra from people asking "why dialogue is breaking them" and telling them they can't make requests ;)

When people hit this, we probably don't want them to do their own retrying right? Perhaps we should even put some instrumentation on this, so we know exactly how many times our client refused to send a request out the door?

Agreed, I like the idea of replacing the limited-channel result of Optional<future<response>> with a union of limited-reason or future<response>. Also like the idea of forcefully attempting a request through the limiter if all nodes have limited our result to avoid purely client-side badness.

iamdanfox

This is clearly a lot closer to the client-behaviour people are used to, so should be less scary to roll out. Excited that it gives us the possibility of bringing back per-endpoint smartness.

I think we might want to come back to the idea of immediate failover again at some point, but given that there's so much c-j-r pain right now, I'd be happy to get this out first.

svc-autorelease · 2020-02-26T16:44:41Z

Released 0.10.2

iamdanfox · 2020-02-26T17:16:56Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java

+                    Executors.newSingleThreadScheduledExecutor(new ThreadFactoryBuilder()
+                            .setNameFormat("dialogue-RetryingChannel-scheduler-%d")
+                            .setDaemon(false)
+                            .build()))));


one more flup please- can we have instrumentation on this? e.g. how many things are currently on it? or will there be a way that I can pass a WC scheduled executor to this? Also possibly want an uncaught exception handler given the recent spiciness we discovered

good idea, would be nice to get metrics on this and identical tracing to other components.

Replace QueuedChannel with a backoff based retryer

554928f

carterkozak requested review from iamdanfox and ferozco February 25, 2020 21:46

carterkozak added 3 commits February 25, 2020 16:50

reduce crufy from the queue

b697561

Fix the simulation scheduler to avoid fast-forward in external use

e83923f

update simulations

bf0bd47

ferozco mentioned this pull request Feb 25, 2020

Differentiate between limited and blacklisted requests #422

Closed

carterkozak added 2 commits February 25, 2020 18:53

improved backoff logging

2aede1d

simplify

9d68989

iamdanfox reviewed Feb 26, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java Outdated Show resolved Hide resolved

iamdanfox reviewed Feb 26, 2020

View reviewed changes

carterkozak added 2 commits February 25, 2020 20:49

cr

e912ef3

Add generated changelog entries

310cfba

carterkozak added 3 commits February 25, 2020 21:48

Revert "Fix the simulation scheduler to avoid fast-forward in externa…

0081d8a

…l use" This reverts commit e83923f.

revert simulation change, comment checkState: allow rewind

f841e93

remove old queue metrics

f036e9d

carterkozak changed the title ~~Replace QueuedChannel with a backoff based retryer~~ fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer Feb 26, 2020

carterkozak added 3 commits February 26, 2020 09:32

Use the abort time as infinity

ff0e692

pre/post tick

95d73f7

comment again

628bc34

iamdanfox reviewed Feb 26, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/ExponentialBackoffStrategy.java Outdated Show resolved Hide resolved

delete unused

91456c6

iamdanfox reviewed Feb 26, 2020

View reviewed changes

Log backoff milliseconds

4562397

iamdanfox reviewed Feb 26, 2020

View reviewed changes

simulation/src/test/java/com/palantir/dialogue/core/Strategy.java Outdated Show resolved Hide resolved

queuedChannelAndRetrying -> retryingChannel

f4f999a

iamdanfox reviewed Feb 26, 2020

View reviewed changes

iamdanfox approved these changes Feb 26, 2020

View reviewed changes

iamdanfox added the autorelease label Feb 26, 2020

carterkozak added the merge when ready label Feb 26, 2020

bulldozer-bot bot merged commit 7d32799 into develop Feb 26, 2020

bulldozer-bot bot deleted the ckozak/no_queue branch February 26, 2020 16:44

iamdanfox reviewed Feb 26, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432

fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432

carterkozak commented Feb 25, 2020

changelog-app bot commented Feb 25, 2020 •

edited by carterkozak

Loading

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020

iamdanfox commented Feb 26, 2020

carterkozak commented Feb 26, 2020

carterkozak commented Feb 26, 2020

iamdanfox Feb 26, 2020

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020 •

edited

Loading

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox left a comment

svc-autorelease commented Feb 26, 2020

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432

fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432

Conversation

carterkozak commented Feb 25, 2020

changelog-app bot commented Feb 25, 2020 • edited by carterkozak Loading

Generate changelog in changelog/@unreleased

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox commented Feb 26, 2020

carterkozak commented Feb 26, 2020

carterkozak commented Feb 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox left a comment

Choose a reason for hiding this comment

svc-autorelease commented Feb 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changelog-app bot commented Feb 25, 2020 •

edited by carterkozak

Loading

Generate changelog in `changelog/@unreleased`

iamdanfox Feb 26, 2020 •

edited

Loading