Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432

Merged
merged 17 commits into from
Feb 26, 2020

Conversation

carterkozak
Copy link
Contributor

==COMMIT_MSG==
Replace QueuedChannel with a backoff based retryer
==COMMIT_MSG==

@changelog-app
Copy link

changelog-app bot commented Feb 25, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Replace QueuedChannel with a backoff based retryer

Check the box to generate changelog(s)

  • Generate changelog entry


ExponentialBackoffStrategy(Duration backoffSlotSize, DoubleSupplier random) {
this.backoffSlotSize = backoffSlotSize;
this.random = random;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/random/jitter/ - it's fine for this to always return 1 in testing, and be a random implementation in prod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to get jitter in our simulations as well, otherwise the we may cause waves of requests. In tests using a constant is fine, but ideally we don't measure wall clock time in tests and watch the request rate/etc.

return 0L;
}
int upperBound = (int) Math.pow(2, failures - 1);
return Math.round(backoffSlotSize.toNanos() * random.getAsDouble() * upperBound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're gonna plumb the BackoffStrategy thing in here right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I don't like it as written in CJR, I find it confusing that getting the duration mutates state and increments failures. I had an implementation that took a backoffstrategy as a function from failure count to backoff duration that worked fine, but without needing alternative implementations it didn't add anything.
Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely would like to not do the mutation as part of the getter. Fair enough that if we're always going to use the exponential strategy, then maybe factoring out is unnecessary!

@iamdanfox
Copy link
Contributor

Since this is quite a big change in behaviour, I'd like to try and articulate the pros and cons of this switch.

  • an upside of the old 'queued channel & instant retry' thing was if there's at least one channel which is not currently limited, then you should be able to get a request out the door instantly. Now however, you might get the "Failed to make a request" exception which causes a backoff if the one node you were pinned to limited you

@carterkozak
Copy link
Contributor Author

Upsides:

  • This allows us to implement per-endpoint limiting because we don’t need to worry about a queue. (We could make limitedChannel extend channel and try to find a non-limited channel before attempting for force our way through, trying a channel+endpoint combination that is likely broken)
  • Easier to reason about because a single limited request doesn’t cause jitter for other requests triggered from separate threads slightly later

@carterkozak
Copy link
Contributor Author

(sorry to split these, poking at this between cleaning/dishes/etc)
Downsides:

  • Potentially lower throughput due to backoff rather than edge-triggering. In practice the BlacklistingChannel resulted in an approximation of backoff without the upsides, and it was too easy to introduce regressions.

@carterkozak carterkozak changed the title Replace QueuedChannel with a backoff based retryer fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer Feb 26, 2020
if (config.maxNumRetries() > 0) {
channel =
new RetryingChannel(channel, config.maxNumRetries(), config.backoffSlotSize(), config.serverQoS());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cute little optimization :)

return false;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so much spiciness into the bin 🌶

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥛:-)

if (log.isInfoEnabled()) {
log.info(
"Retrying call after failure",
SafeArg.of("failures", failures),
SafeArg.of("maxRetries", maxRetries),
SafeArg.of("backoffNanoseconds", backoffNanoseconds),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do a TimeUnit.Milliseconds.convert here and present backoffMillis instead? It's really hard to eyeball such big numbers when they appear in the logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

{"traceId":"3af31cbecda2b46c","parentSpanId":"697e878d5bfc5628","spanId":"28686cc7c2695fb2","type":"LOCAL","operation":"Dialogue-request-attempt","startTimeMicroSeconds":1582562924534905,"durationNanoSeconds":577068,"metadata":{}}
{"traceId":"3af31cbecda2b46c","parentSpanId":"efaf41a21cc7c6a9","spanId":"697e878d5bfc5628","type":"LOCAL","operation":"Dialogue-request initial","startTimeMicroSeconds":1582562924525312,"durationNanoSeconds":13461016,"metadata":{}}
{"traceId":"3af31cbecda2b46c","parentSpanId":null,"spanId":"efaf41a21cc7c6a9","type":"LOCAL","operation":"Dialogue-request","startTimeMicroSeconds":1582562924525264,"durationNanoSeconds":13530208,"metadata":{}}
{"traceId":"09b26a1be498af5c","parentSpanId":"49c88b0d4a7b609a","spanId":"802d8f9c0d0333a7","type":"LOCAL","operation":"Dialogue-http-request initial","startTimeMicroSeconds":1582681721129000,"durationNanoSeconds":7754255,"metadata":{}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of empty space in these - can we get a span containing the backoff duration in there somehow? (Could be as a FLUP)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I'd prefer to do this in a follow-up

live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=82.4% client_mean=PT4.5664688S server_cpu=PT1H52M14.26S client_received=2500/2500 server_resps=2500 codes={200=2061, 500=439}
live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=59.0% client_mean=PT3.5693656S server_cpu=PT1H58M42.9S client_received=2500/2500 server_resps=2500 codes={200=1476, 500=1024}
live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=58.6% client_mean=PT3.5376608S server_cpu=PT1H58M19S client_received=2500/2500 server_resps=2500 codes={200=1466, 500=1034}
live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=59.3% client_mean=PT2.717022131S server_cpu=PT1H21M18.049029538S client_received=2500/2500 server_resps=1865 codes={200=1483, 500=382, Failed to make a request=635}
Copy link
Contributor

@iamdanfox iamdanfox Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 82% -> 59% success makes sense because a bunch of requests are no longer hanging out on the queue, and instead we get a chunk of them come back as "failed to make a request"
https://github.com/palantir/dialogue/blob/ckozak/no_queue/simulation/src/test/resources/report.md#live_reloadingconcurrency_limiter_blacklist_round_robin

// Avoid method reference allocations
@SuppressWarnings("UnnecessaryLambda")
private static final Supplier<ListenableFuture<Response>> limitedResultSupplier =
() -> Futures.immediateFailedFuture(new SafeRuntimeException("Failed to make a request"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to need to give people more information in this exception message (or possibly a comment here) otherwise we're going to be fielding questions in #dev-foundry-infra from people asking "why dialogue is breaking them" and telling them they can't make requests ;)

When people hit this, we probably don't want them to do their own retrying right? Perhaps we should even put some instrumentation on this, so we know exactly how many times our client refused to send a request out the door?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I like the idea of replacing the limited-channel result of Optional<future<response>> with a union of limited-reason or future<response>. Also like the idea of forcefully attempting a request through the limiter if all nodes have limited our result to avoid purely client-side badness.

Copy link
Contributor

@iamdanfox iamdanfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a lot closer to the client-behaviour people are used to, so should be less scary to roll out. Excited that it gives us the possibility of bringing back per-endpoint smartness.

I think we might want to come back to the idea of immediate failover again at some point, but given that there's so much c-j-r pain right now, I'd be happy to get this out first.

@bulldozer-bot bulldozer-bot bot merged commit 7d32799 into develop Feb 26, 2020
@bulldozer-bot bulldozer-bot bot deleted the ckozak/no_queue branch February 26, 2020 16:44
@svc-autorelease
Copy link
Collaborator

Released 0.10.2

Executors.newSingleThreadScheduledExecutor(new ThreadFactoryBuilder()
.setNameFormat("dialogue-RetryingChannel-scheduler-%d")
.setDaemon(false)
.build()))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more flup please- can we have instrumentation on this? e.g. how many things are currently on it? or will there be a way that I can pass a WC scheduled executor to this? Also possibly want an uncaught exception handler given the recent spiciness we discovered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, would be nice to get metrics on this and identical tracing to other components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants