-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: jittered backoff implementation #3791
Conversation
Signed-off-by: Rama <rama.rao@salesforce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @ramaraochavali, a few comments and we can merge. Would be good to make this the default backoff for everything as per @mattklein123 earlier comment.
uint64_t JitteredBackOffStrategy::computeNextInterval() { | ||
current_retry_++; | ||
uint32_t multiplier = (1 << current_retry_) - 1; | ||
uint64_t new_interval = random_.random() % (base_interval_ * multiplier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the behavioral tradeoffs between doing what you have here and something more like (1 << current_retry_) * base_interval + random.random() % some_fixed_interval
? I.e. you still do exponential backoff, but jsut add a small jitter around the point that you backoff to?
I think what you have is fine actually, reading https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/, but it would be worth exploring the tradeoff in the comments here and review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have compared both approaches, Looks like the one I have has more randomness.
Here is output
(1 << current_retry_) * base_interval + random.random() % some_fixed_interval (fixed interval as 100)
1082
2069
4079
8054
16010
30000
random_.random() % (base_interval_ * (1 << current_retry_) - 1)
102
1065
954
1084
7196
28940
23291
30000
9979
|
||
uint64_t JitteredBackOffStrategy::computeNextInterval() { | ||
current_retry_++; | ||
uint32_t multiplier = (1 << current_retry_) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please use const
for these intermediate variables.
|
||
public: | ||
/** | ||
* Use this constructor if max_interval need not be enforced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we need this? I think it is always sensible ot add some bound.
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
current_retry_++; | ||
uint32_t multiplier = (1 << current_retry_) - 1; | ||
// for retries that take longer, multiplier may overflow and become zero. | ||
if (multiplier == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htuch for retries that take longer, noticed that multiplier is becoming zero because of overflow and hence added this check to reset. Also added a test to verify this. LMK if there are any better options here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_
, stop incrementing current_retry_
.
@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap& | |||
// Merge in the route policy. | |||
retry_on_ |= route_policy.retryOn(); | |||
retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries()); | |||
uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25); | |||
backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I have added base * 10 as the maximum interval (bound). I think it should be sufficient. LMK if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable, maybe add a comment or constant.
@htuch made the JitterBackOff default for management server connections also and removed the earlier ExponentialBackOff. Left couple of questions, PTAL. |
current_retry_++; | ||
uint32_t multiplier = (1 << current_retry_) - 1; | ||
// for retries that take longer, multiplier may overflow and become zero. | ||
if (multiplier == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_
, stop incrementing current_retry_
.
current_interval_ = new_interval > max_interval_ ? max_interval_ : new_interval; | ||
uint64_t JitteredBackOffStrategy::nextBackOffMs() { | ||
current_retry_++; | ||
uint32_t multiplier = (1 << current_retry_) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why - 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I just took it from the existing implementation. Not sure why it was like that but one thing comes to my mind is 1 << current_retry_
always returns number that is power of 2 - does it have to do any thing with it - I think -1 would make it more random may be? I just left it like that as it is not big deal. let me know if you want me to change..
DoRetryCallback callback_; | ||
Event::TimerPtr retry_timer_; | ||
Upstream::ResourcePriority priority_; | ||
BackOffStrategyPtr backoff_strategy_ptr_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I'm not a fan of putting _ptr
in the variable name unless it is really needed to make things clearer (e.g. to disambiguate). Here, I would just call this backoff_strategy_
. I think you'll find this consistent with other Envoy code (e.g. retry_timer_
above).
@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap& | |||
// Merge in the route policy. | |||
retry_on_ |= route_policy.retryOn(); | |||
retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries()); | |||
uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: const
@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap& | |||
// Merge in the route policy. | |||
retry_on_ |= route_policy.retryOn(); | |||
retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries()); | |||
uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25); | |||
backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable, maybe add a comment or constant.
Signed-off-by: Rama <rama.rao@salesforce.com>
@htuch made changes. PTAL. |
* origin/master: config: making v2-config-only a boolean flag (envoyproxy#3847) lc trie: add exclusive flag. (envoyproxy#3825) upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783) test: deflaking header_integration_test (envoyproxy#3849) http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776) common: minor doc updates (envoyproxy#3845) fix master build (envoyproxy#3844) logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842) test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843) common: jittered backoff implementation (envoyproxy#3791) format: run buildifier on .bzl files. (envoyproxy#3824) Support mutable metadata for endpoints (envoyproxy#3814) test: deflaking a test, improving debugability (envoyproxy#3829) Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834) Add hard-coded /hot_restart_version test (envoyproxy#3832) healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816) Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Rama rama.rao@salesforce.com
Description:
Implements fully jittered exponential backoff algorithm and refactors router retry timer to use this.
Risk Level: Low
Testing: Added automated tests
Docs Changes: N/A
Release Notes: N/A