-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #351 fix #302 fix #312 Replace QueuedChannel with a backoff based retryer #432
Conversation
Generate changelog in
|
|
||
ExponentialBackoffStrategy(Duration backoffSlotSize, DoubleSupplier random) { | ||
this.backoffSlotSize = backoffSlotSize; | ||
this.random = random; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/random/jitter/ - it's fine for this to always return 1 in testing, and be a random implementation in prod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to get jitter in our simulations as well, otherwise the we may cause waves of requests. In tests using a constant is fine, but ideally we don't measure wall clock time in tests and watch the request rate/etc.
dialogue-core/src/main/java/com/palantir/dialogue/core/RetryingChannel.java
Outdated
Show resolved
Hide resolved
return 0L; | ||
} | ||
int upperBound = (int) Math.pow(2, failures - 1); | ||
return Math.round(backoffSlotSize.toNanos() * random.getAsDouble() * upperBound); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're gonna plumb the BackoffStrategy thing in here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I don't like it as written in CJR, I find it confusing that getting the duration mutates state and increments failures. I had an implementation that took a backoffstrategy as a function from failure count to backoff duration that worked fine, but without needing alternative implementations it didn't add anything.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definitely would like to not do the mutation as part of the getter. Fair enough that if we're always going to use the exponential strategy, then maybe factoring out is unnecessary!
Since this is quite a big change in behaviour, I'd like to try and articulate the pros and cons of this switch.
|
Upsides:
|
(sorry to split these, poking at this between cleaning/dishes/etc)
|
if (config.maxNumRetries() > 0) { | ||
channel = | ||
new RetryingChannel(channel, config.maxNumRetries(), config.backoffSlotSize(), config.serverQoS()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cute little optimization :)
dialogue-core/src/main/java/com/palantir/dialogue/core/ExponentialBackoffStrategy.java
Outdated
Show resolved
Hide resolved
return false; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so much spiciness into the bin 🌶
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥛:-)
if (log.isInfoEnabled()) { | ||
log.info( | ||
"Retrying call after failure", | ||
SafeArg.of("failures", failures), | ||
SafeArg.of("maxRetries", maxRetries), | ||
SafeArg.of("backoffNanoseconds", backoffNanoseconds), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we do a TimeUnit.Milliseconds.convert here and present backoffMillis instead? It's really hard to eyeball such big numbers when they appear in the logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call
{"traceId":"3af31cbecda2b46c","parentSpanId":"697e878d5bfc5628","spanId":"28686cc7c2695fb2","type":"LOCAL","operation":"Dialogue-request-attempt","startTimeMicroSeconds":1582562924534905,"durationNanoSeconds":577068,"metadata":{}} | ||
{"traceId":"3af31cbecda2b46c","parentSpanId":"efaf41a21cc7c6a9","spanId":"697e878d5bfc5628","type":"LOCAL","operation":"Dialogue-request initial","startTimeMicroSeconds":1582562924525312,"durationNanoSeconds":13461016,"metadata":{}} | ||
{"traceId":"3af31cbecda2b46c","parentSpanId":null,"spanId":"efaf41a21cc7c6a9","type":"LOCAL","operation":"Dialogue-request","startTimeMicroSeconds":1582562924525264,"durationNanoSeconds":13530208,"metadata":{}} | ||
{"traceId":"09b26a1be498af5c","parentSpanId":"49c88b0d4a7b609a","spanId":"802d8f9c0d0333a7","type":"LOCAL","operation":"Dialogue-http-request initial","startTimeMicroSeconds":1582681721129000,"durationNanoSeconds":7754255,"metadata":{}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of empty space in these - can we get a span containing the backoff duration in there somehow? (Could be as a FLUP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I'd prefer to do this in a follow-up
simulation/src/test/java/com/palantir/dialogue/core/Strategy.java
Outdated
Show resolved
Hide resolved
live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=82.4% client_mean=PT4.5664688S server_cpu=PT1H52M14.26S client_received=2500/2500 server_resps=2500 codes={200=2061, 500=439} | ||
live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=59.0% client_mean=PT3.5693656S server_cpu=PT1H58M42.9S client_received=2500/2500 server_resps=2500 codes={200=1476, 500=1024} | ||
live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=58.6% client_mean=PT3.5376608S server_cpu=PT1H58M19S client_received=2500/2500 server_resps=2500 codes={200=1466, 500=1034} | ||
live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=59.3% client_mean=PT2.717022131S server_cpu=PT1H21M18.049029538S client_received=2500/2500 server_resps=1865 codes={200=1483, 500=382, Failed to make a request=635} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 82% -> 59% success makes sense because a bunch of requests are no longer hanging out on the queue, and instead we get a chunk of them come back as "failed to make a request"
https://github.com/palantir/dialogue/blob/ckozak/no_queue/simulation/src/test/resources/report.md#live_reloadingconcurrency_limiter_blacklist_round_robin
// Avoid method reference allocations | ||
@SuppressWarnings("UnnecessaryLambda") | ||
private static final Supplier<ListenableFuture<Response>> limitedResultSupplier = | ||
() -> Futures.immediateFailedFuture(new SafeRuntimeException("Failed to make a request")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're going to need to give people more information in this exception message (or possibly a comment here) otherwise we're going to be fielding questions in #dev-foundry-infra
from people asking "why dialogue is breaking them" and telling them they can't make requests ;)
When people hit this, we probably don't want them to do their own retrying right? Perhaps we should even put some instrumentation on this, so we know exactly how many times our client refused to send a request out the door?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I like the idea of replacing the limited-channel result of Optional<future<response>>
with a union of limited-reason
or future<response>
. Also like the idea of forcefully attempting a request through the limiter if all nodes have limited our result to avoid purely client-side badness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clearly a lot closer to the client-behaviour people are used to, so should be less scary to roll out. Excited that it gives us the possibility of bringing back per-endpoint smartness.
I think we might want to come back to the idea of immediate failover again at some point, but given that there's so much c-j-r pain right now, I'd be happy to get this out first.
Released 0.10.2 |
Executors.newSingleThreadScheduledExecutor(new ThreadFactoryBuilder() | ||
.setNameFormat("dialogue-RetryingChannel-scheduler-%d") | ||
.setDaemon(false) | ||
.build())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more flup please- can we have instrumentation on this? e.g. how many things are currently on it? or will there be a way that I can pass a WC scheduled executor to this? Also possibly want an uncaught exception handler given the recent spiciness we discovered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, would be nice to get metrics on this and identical tracing to other components.
==COMMIT_MSG==
Replace QueuedChannel with a backoff based retryer
==COMMIT_MSG==