-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config: backoff strategy implementation #3758
Conversation
Signed-off-by: Rama <rama.rao@salesforce.com>
@htuch @mattklein123 can you PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ramaraochavali for adding this!
return current_interval_; | ||
} | ||
|
||
uint64_t ExponentialBackOffStrategy::multiplyInterval() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a standalone method? Seems simple enough to inline. OTOH, if you want to later make this more sophisticated (see comment below), it should be standalone (just not named multiple..
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will inline it for now and refactor it based on how random exponential is implemented in my follow-up PR
uint64_t ExponentialBackOffStrategy::multiplyInterval() { | ||
uint64_t temp_interval = current_interval_; | ||
temp_interval *= multiplier_; | ||
return (temp_interval > max_interval_ ? max_interval_ : temp_interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want any randomness in the backoff to desynchronize retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will follow-up with a PR to introduce randomness based backoff. Then I will see if new class makes sense or refactor this. Till that time I will keep this as is.
const std::uint32_t multiplier) | ||
: initial_interval_(initial_interval), max_interval_(max_interval), multiplier_(multiplier), | ||
current_interval_(0) { | ||
ASSERT(!(multiplier_ <= 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should multiplier be a double
? I.e. do we want to force the minimum backoff to be 2x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change it to double. If we change it to double, I think minimum backoff can be 1.5 so I will change this like that. Is that fine?
/** | ||
* Generic interface for all backoff strategy implementations. | ||
*/ | ||
class BackOffStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably live in include/envoy
since it is a pure interface.
/** | ||
* Returns the next backoff interval. | ||
*/ | ||
virtual std::uint64_t nextBackOff() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just uint64_t
rather than std::uint64_t
.
namespace Envoy { | ||
|
||
ExponentialBackOffStrategy::ExponentialBackOffStrategy(const std::uint64_t initial_interval, | ||
const std::uint64_t max_interval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to remove the std::
prefix on all the numeric types in this PR.
source/common/config/grpc_mux_impl.h
Outdated
@@ -42,7 +43,9 @@ class GrpcMuxImpl : public GrpcMux, | |||
void onRemoteClose(Grpc::Status::GrpcStatus status, const std::string& message) override; | |||
|
|||
// TODO(htuch): Make this configurable or some static. | |||
const uint32_t RETRY_DELAY_MS = 5000; | |||
const uint32_t RETRY_INITIAL_DELAY_MS = 5000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have sensible backoff, I think you should shrink this down to something very small for the initial retry, e.g. 500ms. This will fix the problem that Istio are seeing (@costinm) in some cases where Pilot isn't ready. The exponential backoff will take us to a large number pretty quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will change
@@ -76,6 +76,7 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap& | |||
RetryStateImpl::~RetryStateImpl() { resetRetry(); } | |||
|
|||
void RetryStateImpl::enableBackoffTimer() { | |||
// TODO(ramaraochavali): Implement JitteredExponentialBackOff and refactor this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, you have thought about jitter here..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, and, honestly, you should be using jittered backoff for the management connection also. I would not support any backoff policy that is not jittered...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 i am ok with that. Let us review the PR for jittered backoff #3791 and I can move management connection to use this. @htuch thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think have jitter as default (and only built-in) scheme makes sense, best not to provide sharp objects that aren't needed.
source/common/config/grpc_mux_impl.h
Outdated
@@ -42,7 +43,9 @@ class GrpcMuxImpl : public GrpcMux, | |||
void onRemoteClose(Grpc::Status::GrpcStatus status, const std::string& message) override; | |||
|
|||
// TODO(htuch): Make this configurable or some static. | |||
const uint32_t RETRY_DELAY_MS = 5000; | |||
const uint32_t RETRY_INITIAL_DELAY_MS = 5000; | |||
const uint32_t RETRY_MAX_DELAY_MS = 120000; // Do not cross more than 2 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems too large IMHO for default, 2 minutes is a long time.. How about 30s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine. Just put some number so that we can discuss
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@htuch addressed all the comments. PTAL. |
@htuch when you get chance can you PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks, a few more comments/questions.
const double multiplier) | ||
: initial_interval_(initial_interval), max_interval_(max_interval), multiplier_(multiplier), | ||
current_interval_(0) { | ||
ASSERT(multiplier_ >= 1.5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want it to be more than 2? I thought people can use 1.5 as multiplier as well. that is why I modified this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest >= 1.0
; 1.125 would be a valid value, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I think we discussed earlier that should be more than 1.0. So would validate against >1.0
|
||
namespace Envoy { | ||
|
||
ExponentialBackOffStrategy::ExponentialBackOffStrategy(const uint64_t initial_interval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't make scalar args consts in Envoy (although it's not a terrible idea)..
} | ||
|
||
TEST(BackOffStrategyTest, ExponentialBackOfReset) { | ||
ExponentialBackOffStrategy exponential_back_off(10, 100, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do some tests with a fractional backoff multiplier?
@@ -16,6 +16,8 @@ GrpcMuxImpl::GrpcMuxImpl(const envoy::api::v2::core::Node& node, Grpc::AsyncClie | |||
: node_(node), async_client_(std::move(async_client)), service_method_(service_method), | |||
time_source_(time_source) { | |||
retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); }); | |||
backoff_strategy_ptr_ = std::make_unique<ExponentialBackOffStrategy>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this affect any existing tests? If not, how come?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did not actually.
Signed-off-by: Rama <rama.rao@salesforce.com>
@htuch addressed the changes. PTAL.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good modulo final comments.
|
||
uint64_t initial_interval_; | ||
uint64_t max_interval_; | ||
double multiplier_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: a bunch of these should be const as they are only set in the constructor.
virtual ~BackOffStrategy() {} | ||
|
||
/** | ||
* Returns the next backoff interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Doxygen @return
here. Also, please express the units.
/** | ||
* Returns the next backoff interval. | ||
*/ | ||
virtual uint64_t nextBackOff() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming to nextBackoffMilliseconds
or nextBackoffMs
.
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@htuch addressed. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rad, thanks!
Signed-off-by: Rama rama.rao@salesforce.com
Description:
Implements an
ExponentialBackOffStrategy
that is used when Envoy retries management server connection.Risk Level: Low
Testing: Added automated tests
Docs Changes: N/A
Release Notes: N/A
Fixes #3737