Differentiate between limited and blacklisted requests #422

ferozco · 2020-02-24T20:58:44Z

Before this PR

With blacklisting enabled we would could fail to the over all request without actually ever making a remote call if we hit a blacklisted channel on each retry

After this PR

==COMMIT_MSG==
Enable channel blacklisting by default and do not count requests to blacklisted hosts against retry limit
==COMMIT_MSG==

Possible downsides?

…imits-blacklist

changelog-app · 2020-02-24T21:17:18Z

Generate changelog in `changelog/@unreleased`

Type

Description

Enable channel blacklisting by default and do not count requests to blacklisted hosts against retry limit

Check the box to generate changelog(s)

Generate changelog entry

carterkozak · 2020-02-24T21:15:10Z

dialogue-core/src/main/java/com/palantir/dialogue/core/Channels.java

                .map(concurrencyLimiter(config, clientMetrics))
+                .map(channel -> new BlacklistingChannel(channel, config.failedUrlCooldown(), queueListener))


after any failure, this will put the channel into probation mode by default (failedUrlCooldown == 0 using the default configuration).
This could be problematic for timelock, perhaps we should check if clientQos is disabled for this?

ya good catch

carterkozak · 2020-02-24T21:22:14Z

dialogue-core/src/main/java/com/palantir/dialogue/core/RoundRobinChannel.java

+            if (maybeCall.matches(isBlacklisted)) {
+                continue;
+            } else if (maybeCall.matches(isLimited)) {
+                return Optional.empty();


If a response is limited by one channel (too many concurrent requests?) we don't attempt to rotate to the next channel?

ya it seems correct to skip over the the limited channel

But this stops iterating and doesn't attempt any more channels unless I'm reading it incorrectly.

Yes, I think its right to stop iterating if you hit a limited channel to give time for the limit to be released. Otherwise you'll end up just spinning until a limit is released

Each delegate has its own limit, the next channel may not be limited in the same way

For example when a new host is discovered we want to allow most requests to rotate to that channel even if the rest are some combination of saturated/blacklisted.

carterkozak · 2020-02-24T21:34:34Z

dialogue-core/src/main/java/com/palantir/dialogue/core/CompositeLimitedChannel.java

+import com.palantir.dialogue.Endpoint;
+import com.palantir.dialogue.Request;
+
+public interface CompositeLimitedChannel {


package private

Is there any reason we should have both LimitedChannel and CompositeLimitedChannel?

It was to better separate concerns. Node selection strategies should know about blacklisting and limiting while queueing/retrying only needs to know about whether a request occurred or not

carterkozak · 2020-02-24T21:36:12Z

dialogue-core/src/main/java/com/palantir/dialogue/core/LimitedResponse.java

+@Data
+public interface LimitedResponse {
+    interface Cases<T> {
+        T blacklisted();


Is this only set by blacklisting channel? As a consumer how do I handle this differently from a limited response? The blacklisted channel will become un-blacklisted eventually, much like a limited channel will open up.

I think docs would help me a lot.

I'll add some docs

carterkozak · 2020-02-24T21:39:01Z

dialogue-core/src/main/java/com/palantir/dialogue/core/LimitedResponse.java

+import org.derive4j.Data;
+
+@Data
+public interface LimitedResponse {


This appears to only be used by RoundRobinChannel, is that correct?

No, all selection strategies convert from LimitedReponses to Optional<Response>. The idea is that selection strategies need more information to properly select which channel to make a request to

iamdanfox · 2020-02-24T23:28:09Z

simulation/src/test/resources/report.md

-               live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=58.9%	client_mean=PT3.5763136S   	server_cpu=PT1H58M42.9S   	client_received=2500/2500	server_resps=2500	codes={200=1473, 500=1027}
-                   live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=58.6%	client_mean=PT3.5376608S   	server_cpu=PT1H58M19S     	client_received=2500/2500	server_resps=2500	codes={200=1466, 500=1034}
+         live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt:	success=79.6%	client_mean=PT4.5579616S   	server_cpu=PT1H54M0.29S   	client_received=2500/2500	server_resps=2500	codes={200=1990, 500=510}
+               live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=93.0%	client_mean=PT7.1876512S   	server_cpu=PT2H48.6S      	client_received=2500/2500	server_resps=2500	codes={200=2326, 500=174}


is it expected that live pin_until_error now seems to be not touching the newly live-reloaded server? https://github.com/palantir/dialogue/blob/fo/differentiate-limits-blacklist/simulation/src/test/resources/report.md#live_reloadingconcurrency_limiter_pin_until_error

iamdanfox · 2020-02-24T23:30:32Z

simulation/src/test/resources/report.md

-slowdown_and_error_thresholds[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=1.8%	client_mean=PT20.068609533S	server_cpu=PT10H35M7.200666646S	client_received=10000/10000	server_resps=10000	codes={200=176, 500=9824}
-    slowdown_and_error_thresholds[CONCURRENCY_LIMITER_ROUND_ROBIN].txt:	success=1.2%	client_mean=PT16.859225466S	server_cpu=PT10H30M49.207333306S	client_received=10000/10000	server_resps=10000	codes={200=120, 500=9880}
+slowdown_and_error_thresholds[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt:	success=1.8%	client_mean=PT1M8.848058466S	server_cpu=PT10H40M56.53999998S	client_received=10000/10000	server_resps=10000	codes={200=183, 500=9817}
+slowdown_and_error_thresholds[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt:	success=9.4%	client_mean=PT1M4.588908199S	server_cpu=PT10H49M0.733999957S	client_received=10000/10000	server_resps=10000	codes={200=936, 500=9064}


as expected, by not consuming some retries in our limits, this means requests can hang on for longer (so take longer from a user-perspective), but ultimately more succeed. 1.8% -> 9.4%

iamdanfox · 2020-02-24T23:33:57Z

dialogue-core/src/main/java/com/palantir/dialogue/core/BlacklistingChannel.java

@@ -49,7 +50,7 @@
 * unblacklisted. Without this functionality, hundreds of requests could be sent to a still-broken
 * server before the first of them returns and tells us it's still broken.
 */
-final class BlacklistingChannel implements LimitedChannel {
+final class BlacklistingChannel implements CompositeLimitedChannel {


can we tweak the names so that your CompositeLimitedChannel just becomes LimitedChannel and then rename LimitedChannel -> NodeSelectionChannel (because that's our PinUntilError, RoundRobin implementations)?

iamdanfox · 2020-02-24T23:35:12Z

dialogue-core/src/main/java/com/palantir/dialogue/core/BlacklistingChannel.java

        BlacklistState state = channelBlacklistState.get();
        if (state != null) {
            BlacklistStage stage = state.maybeProgressAndGet();
            if (stage instanceof BlacklistUntil) {
-                return Optional.empty();
+                return LimitedResponses.blacklisted();


Rather than LimitedResponses.blacklisted, what do you think of LimitedResponses.backOff ?

iamdanfox · 2020-02-24T23:37:58Z

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java

-            if (result.isPresent()) {
-                DialogueFutures.addDirectCallback(result.get(), new LimiterCallback(listener));
-            } else {
-                listener.onIgnore();


why are we no longer calling listener.onIgnore in the other two cases? I think using a visitor on the limitedResponse would be more reassuring to me here

iamdanfox · 2020-02-24T23:41:41Z

dialogue-core/src/main/java/com/palantir/dialogue/core/PinUntilErrorChannel.java

-                }
+            public Optional<ListenableFuture<Response>> limited() {
+                debugLogLimitedRequest(currentIndex, channel);
+                return Optional.empty();


what does it semantically mean for the PinUntilErrorChannel to return Optional.empty? If an inner concurrency limiter decided this host was overloaded, why are we not selecting another host here?

I think this also explains why https://github.com/palantir/dialogue/blob/fo/differentiate-limits-blacklist/simulation/src/test/resources/report.md#drastic_slowdownconcurrency_limiter_pin_until_error doesn't look good after the slow node was reverted.

ferozco · 2020-02-25T23:22:52Z

Closing in favour of #432

forozco added 2 commits February 24, 2020 15:56

differentiate between requests to limited and blacklisted hosts

bfe54ce

update simulations

34e16e8

ferozco requested review from iamdanfox and carterkozak February 24, 2020 20:58

forozco added 2 commits February 24, 2020 16:07

Merge remote-tracking branch 'origin/develop' into fo/differentiate-l…

de03f68

…imits-blacklist

Add generated changelog entries

162ded7

ferozco added no changelog and removed no changelog labels Feb 24, 2020

carterkozak reviewed Feb 24, 2020

View reviewed changes

iamdanfox reviewed Feb 24, 2020

View reviewed changes

ferozco mentioned this pull request Feb 25, 2020

[WIP] Limited channel is completely async #429

Closed

ferozco closed this Feb 25, 2020

ferozco deleted the fo/differentiate-limits-blacklist branch February 25, 2020 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiate between limited and blacklisted requests #422

Differentiate between limited and blacklisted requests #422

ferozco commented Feb 24, 2020

changelog-app bot commented Feb 24, 2020 •

edited by ferozco

Loading

carterkozak Feb 24, 2020

ferozco Feb 24, 2020

carterkozak Feb 24, 2020

ferozco Feb 24, 2020

carterkozak Feb 24, 2020

ferozco Feb 24, 2020

carterkozak Feb 24, 2020

carterkozak Feb 24, 2020

carterkozak Feb 24, 2020

carterkozak Feb 24, 2020

ferozco Feb 24, 2020

carterkozak Feb 24, 2020

ferozco Feb 24, 2020

carterkozak Feb 24, 2020

ferozco Feb 24, 2020 •

edited

Loading

iamdanfox Feb 24, 2020

iamdanfox Feb 24, 2020

iamdanfox Feb 24, 2020

iamdanfox Feb 24, 2020

iamdanfox Feb 24, 2020

iamdanfox Feb 24, 2020

iamdanfox Feb 24, 2020

ferozco commented Feb 25, 2020

		.map(concurrencyLimiter(config, clientMetrics))
		.map(channel -> new BlacklistingChannel(channel, config.failedUrlCooldown(), queueListener))

Differentiate between limited and blacklisted requests #422

Differentiate between limited and blacklisted requests #422

Conversation

ferozco commented Feb 24, 2020

Before this PR

After this PR

Possible downsides?

changelog-app bot commented Feb 24, 2020 • edited by ferozco Loading

Generate changelog in changelog/@unreleased

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferozco Feb 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferozco commented Feb 25, 2020

changelog-app bot commented Feb 24, 2020 •

edited by ferozco

Loading

Generate changelog in `changelog/@unreleased`

ferozco Feb 24, 2020 •

edited

Loading