Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate between limited and blacklisted requests #422

Closed
wants to merge 4 commits into from

Conversation

ferozco
Copy link
Contributor

@ferozco ferozco commented Feb 24, 2020

Before this PR

With blacklisting enabled we would could fail to the over all request without actually ever making a remote call if we hit a blacklisted channel on each retry

After this PR

==COMMIT_MSG==
Enable channel blacklisting by default and do not count requests to blacklisted hosts against retry limit
==COMMIT_MSG==

Possible downsides?

@changelog-app
Copy link

changelog-app bot commented Feb 24, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Enable channel blacklisting by default and do not count requests to blacklisted hosts against retry limit

Check the box to generate changelog(s)

  • Generate changelog entry

.map(concurrencyLimiter(config, clientMetrics))
.map(channel -> new BlacklistingChannel(channel, config.failedUrlCooldown(), queueListener))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after any failure, this will put the channel into probation mode by default (failedUrlCooldown == 0 using the default configuration).
This could be problematic for timelock, perhaps we should check if clientQos is disabled for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya good catch

if (maybeCall.matches(isBlacklisted)) {
continue;
} else if (maybeCall.matches(isLimited)) {
return Optional.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a response is limited by one channel (too many concurrent requests?) we don't attempt to rotate to the next channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya it seems correct to skip over the the limited channel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this stops iterating and doesn't attempt any more channels unless I'm reading it incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think its right to stop iterating if you hit a limited channel to give time for the limit to be released. Otherwise you'll end up just spinning until a limit is released

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each delegate has its own limit, the next channel may not be limited in the same way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example when a new host is discovered we want to allow most requests to rotate to that channel even if the rest are some combination of saturated/blacklisted.

import com.palantir.dialogue.Endpoint;
import com.palantir.dialogue.Request;

public interface CompositeLimitedChannel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package private

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we should have both LimitedChannel and CompositeLimitedChannel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to better separate concerns. Node selection strategies should know about blacklisting and limiting while queueing/retrying only needs to know about whether a request occurred or not

@Data
public interface LimitedResponse {
interface Cases<T> {
T blacklisted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only set by blacklisting channel? As a consumer how do I handle this differently from a limited response? The blacklisted channel will become un-blacklisted eventually, much like a limited channel will open up.

I think docs would help me a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some docs

import org.derive4j.Data;

@Data
public interface LimitedResponse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to only be used by RoundRobinChannel, is that correct?

Copy link
Contributor Author

@ferozco ferozco Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, all selection strategies convert from LimitedReponses to Optional<Response>. The idea is that selection strategies need more information to properly select which channel to make a request to

live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=58.9% client_mean=PT3.5763136S server_cpu=PT1H58M42.9S client_received=2500/2500 server_resps=2500 codes={200=1473, 500=1027}
live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=58.6% client_mean=PT3.5376608S server_cpu=PT1H58M19S client_received=2500/2500 server_resps=2500 codes={200=1466, 500=1034}
live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=79.6% client_mean=PT4.5579616S server_cpu=PT1H54M0.29S client_received=2500/2500 server_resps=2500 codes={200=1990, 500=510}
live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=93.0% client_mean=PT7.1876512S server_cpu=PT2H48.6S client_received=2500/2500 server_resps=2500 codes={200=2326, 500=174}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slowdown_and_error_thresholds[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=1.8% client_mean=PT20.068609533S server_cpu=PT10H35M7.200666646S client_received=10000/10000 server_resps=10000 codes={200=176, 500=9824}
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=1.2% client_mean=PT16.859225466S server_cpu=PT10H30M49.207333306S client_received=10000/10000 server_resps=10000 codes={200=120, 500=9880}
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=1.8% client_mean=PT1M8.848058466S server_cpu=PT10H40M56.53999998S client_received=10000/10000 server_resps=10000 codes={200=183, 500=9817}
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=9.4% client_mean=PT1M4.588908199S server_cpu=PT10H49M0.733999957S client_received=10000/10000 server_resps=10000 codes={200=936, 500=9064}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as expected, by not consuming some retries in our limits, this means requests can hang on for longer (so take longer from a user-perspective), but ultimately more succeed. 1.8% -> 9.4%

@@ -49,7 +50,7 @@
* unblacklisted. Without this functionality, hundreds of requests could be sent to a still-broken
* server before the first of them returns and tells us it's still broken.
*/
final class BlacklistingChannel implements LimitedChannel {
final class BlacklistingChannel implements CompositeLimitedChannel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we tweak the names so that your CompositeLimitedChannel just becomes LimitedChannel and then rename LimitedChannel -> NodeSelectionChannel (because that's our PinUntilError, RoundRobin implementations)?

BlacklistState state = channelBlacklistState.get();
if (state != null) {
BlacklistStage stage = state.maybeProgressAndGet();
if (stage instanceof BlacklistUntil) {
return Optional.empty();
return LimitedResponses.blacklisted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than LimitedResponses.blacklisted, what do you think of LimitedResponses.backOff ?

if (result.isPresent()) {
DialogueFutures.addDirectCallback(result.get(), new LimiterCallback(listener));
} else {
listener.onIgnore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we no longer calling listener.onIgnore in the other two cases? I think using a visitor on the limitedResponse would be more reassuring to me here

}
public Optional<ListenableFuture<Response>> limited() {
debugLogLimitedRequest(currentIndex, channel);
return Optional.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it semantically mean for the PinUntilErrorChannel to return Optional.empty? If an inner concurrency limiter decided this host was overloaded, why are we not selecting another host here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ferozco
Copy link
Contributor Author

ferozco commented Feb 25, 2020

Closing in favour of #432

@ferozco ferozco closed this Feb 25, 2020
@ferozco ferozco deleted the fo/differentiate-limits-blacklist branch February 25, 2020 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants