-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differentiate between limited and blacklisted requests #422
Conversation
Generate changelog in
|
.map(concurrencyLimiter(config, clientMetrics)) | ||
.map(channel -> new BlacklistingChannel(channel, config.failedUrlCooldown(), queueListener)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after any failure, this will put the channel into probation mode by default (failedUrlCooldown == 0 using the default configuration).
This could be problematic for timelock, perhaps we should check if clientQos is disabled for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya good catch
if (maybeCall.matches(isBlacklisted)) { | ||
continue; | ||
} else if (maybeCall.matches(isLimited)) { | ||
return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a response is limited by one channel (too many concurrent requests?) we don't attempt to rotate to the next channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya it seems correct to skip over the the limited channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this stops iterating and doesn't attempt any more channels unless I'm reading it incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think its right to stop iterating if you hit a limited channel to give time for the limit to be released. Otherwise you'll end up just spinning until a limit is released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each delegate has its own limit, the next channel may not be limited in the same way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example when a new host is discovered we want to allow most requests to rotate to that channel even if the rest are some combination of saturated/blacklisted.
import com.palantir.dialogue.Endpoint; | ||
import com.palantir.dialogue.Request; | ||
|
||
public interface CompositeLimitedChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason we should have both LimitedChannel and CompositeLimitedChannel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to better separate concerns. Node selection strategies should know about blacklisting and limiting while queueing/retrying only needs to know about whether a request occurred or not
@Data | ||
public interface LimitedResponse { | ||
interface Cases<T> { | ||
T blacklisted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only set by blacklisting channel? As a consumer how do I handle this differently from a limited response? The blacklisted channel will become un-blacklisted eventually, much like a limited channel will open up.
I think docs would help me a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some docs
import org.derive4j.Data; | ||
|
||
@Data | ||
public interface LimitedResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to only be used by RoundRobinChannel, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, all selection strategies convert from LimitedReponses to Optional<Response>
. The idea is that selection strategies need more information to properly select which channel to make a request to
live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=58.9% client_mean=PT3.5763136S server_cpu=PT1H58M42.9S client_received=2500/2500 server_resps=2500 codes={200=1473, 500=1027} | ||
live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=58.6% client_mean=PT3.5376608S server_cpu=PT1H58M19S client_received=2500/2500 server_resps=2500 codes={200=1466, 500=1034} | ||
live_reloading[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=79.6% client_mean=PT4.5579616S server_cpu=PT1H54M0.29S client_received=2500/2500 server_resps=2500 codes={200=1990, 500=510} | ||
live_reloading[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=93.0% client_mean=PT7.1876512S server_cpu=PT2H48.6S client_received=2500/2500 server_resps=2500 codes={200=2326, 500=174} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it expected that live pin_until_error now seems to be not touching the newly live-reloaded server? https://github.com/palantir/dialogue/blob/fo/differentiate-limits-blacklist/simulation/src/test/resources/report.md#live_reloadingconcurrency_limiter_pin_until_error
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=1.8% client_mean=PT20.068609533S server_cpu=PT10H35M7.200666646S client_received=10000/10000 server_resps=10000 codes={200=176, 500=9824} | ||
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=1.2% client_mean=PT16.859225466S server_cpu=PT10H30M49.207333306S client_received=10000/10000 server_resps=10000 codes={200=120, 500=9880} | ||
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=1.8% client_mean=PT1M8.848058466S server_cpu=PT10H40M56.53999998S client_received=10000/10000 server_resps=10000 codes={200=183, 500=9817} | ||
slowdown_and_error_thresholds[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=9.4% client_mean=PT1M4.588908199S server_cpu=PT10H49M0.733999957S client_received=10000/10000 server_resps=10000 codes={200=936, 500=9064} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as expected, by not consuming some retries in our limits, this means requests can hang on for longer (so take longer from a user-perspective), but ultimately more succeed. 1.8% -> 9.4%
@@ -49,7 +50,7 @@ | |||
* unblacklisted. Without this functionality, hundreds of requests could be sent to a still-broken | |||
* server before the first of them returns and tells us it's still broken. | |||
*/ | |||
final class BlacklistingChannel implements LimitedChannel { | |||
final class BlacklistingChannel implements CompositeLimitedChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we tweak the names so that your CompositeLimitedChannel just becomes LimitedChannel and then rename LimitedChannel -> NodeSelectionChannel (because that's our PinUntilError, RoundRobin implementations)?
BlacklistState state = channelBlacklistState.get(); | ||
if (state != null) { | ||
BlacklistStage stage = state.maybeProgressAndGet(); | ||
if (stage instanceof BlacklistUntil) { | ||
return Optional.empty(); | ||
return LimitedResponses.blacklisted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than LimitedResponses.blacklisted
, what do you think of LimitedResponses.backOff
?
if (result.isPresent()) { | ||
DialogueFutures.addDirectCallback(result.get(), new LimiterCallback(listener)); | ||
} else { | ||
listener.onIgnore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we no longer calling listener.onIgnore
in the other two cases? I think using a visitor on the limitedResponse would be more reassuring to me here
} | ||
public Optional<ListenableFuture<Response>> limited() { | ||
debugLogLimitedRequest(currentIndex, channel); | ||
return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it semantically mean for the PinUntilErrorChannel to return Optional.empty? If an inner concurrency limiter decided this host was overloaded, why are we not selecting another host here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also explains why https://github.com/palantir/dialogue/blob/fo/differentiate-limits-blacklist/simulation/src/test/resources/report.md#drastic_slowdownconcurrency_limiter_pin_until_error doesn't look good after the slow node was reverted.
Closing in favour of #432 |
Before this PR
With blacklisting enabled we would could fail to the over all request without actually ever making a remote call if we hit a blacklisted channel on each retry
After this PR
==COMMIT_MSG==
Enable channel blacklisting by default and do not count requests to blacklisted hosts against retry limit
==COMMIT_MSG==
Possible downsides?