Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: backoff strategy implementation #3758

Merged
merged 7 commits into from
Jul 4, 2018

Conversation

ramaraochavali
Copy link
Contributor

Signed-off-by: Rama rama.rao@salesforce.com
Description:
Implements an ExponentialBackOffStrategy that is used when Envoy retries management server connection.
Risk Level: Low
Testing: Added automated tests
Docs Changes: N/A
Release Notes: N/A
Fixes #3737

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch @mattklein123 can you PTAL?

@htuch htuch self-assigned this Jun 29, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramaraochavali for adding this!

return current_interval_;
}

uint64_t ExponentialBackOffStrategy::multiplyInterval() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a standalone method? Seems simple enough to inline. OTOH, if you want to later make this more sophisticated (see comment below), it should be standalone (just not named multiple..).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will inline it for now and refactor it based on how random exponential is implemented in my follow-up PR

uint64_t ExponentialBackOffStrategy::multiplyInterval() {
uint64_t temp_interval = current_interval_;
temp_interval *= multiplier_;
return (temp_interval > max_interval_ ? max_interval_ : temp_interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want any randomness in the backoff to desynchronize retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow-up with a PR to introduce randomness based backoff. Then I will see if new class makes sense or refactor this. Till that time I will keep this as is.

const std::uint32_t multiplier)
: initial_interval_(initial_interval), max_interval_(max_interval), multiplier_(multiplier),
current_interval_(0) {
ASSERT(!(multiplier_ <= 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should multiplier be a double? I.e. do we want to force the minimum backoff to be 2x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change it to double. If we change it to double, I think minimum backoff can be 1.5 so I will change this like that. Is that fine?

/**
* Generic interface for all backoff strategy implementations.
*/
class BackOffStrategy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably live in include/envoy since it is a pure interface.

/**
* Returns the next backoff interval.
*/
virtual std::uint64_t nextBackOff() PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just uint64_t rather than std::uint64_t.

namespace Envoy {

ExponentialBackOffStrategy::ExponentialBackOffStrategy(const std::uint64_t initial_interval,
const std::uint64_t max_interval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to remove the std:: prefix on all the numeric types in this PR.

@@ -42,7 +43,9 @@ class GrpcMuxImpl : public GrpcMux,
void onRemoteClose(Grpc::Status::GrpcStatus status, const std::string& message) override;

// TODO(htuch): Make this configurable or some static.
const uint32_t RETRY_DELAY_MS = 5000;
const uint32_t RETRY_INITIAL_DELAY_MS = 5000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have sensible backoff, I think you should shrink this down to something very small for the initial retry, e.g. 500ms. This will fix the problem that Istio are seeing (@costinm) in some cases where Pilot isn't ready. The exponential backoff will take us to a large number pretty quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will change

@@ -76,6 +76,7 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
RetryStateImpl::~RetryStateImpl() { resetRetry(); }

void RetryStateImpl::enableBackoffTimer() {
// TODO(ramaraochavali): Implement JitteredExponentialBackOff and refactor this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, you have thought about jitter here..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, and, honestly, you should be using jittered backoff for the management connection also. I would not support any backoff policy that is not jittered...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 i am ok with that. Let us review the PR for jittered backoff #3791 and I can move management connection to use this. @htuch thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think have jitter as default (and only built-in) scheme makes sense, best not to provide sharp objects that aren't needed.

@@ -42,7 +43,9 @@ class GrpcMuxImpl : public GrpcMux,
void onRemoteClose(Grpc::Status::GrpcStatus status, const std::string& message) override;

// TODO(htuch): Make this configurable or some static.
const uint32_t RETRY_DELAY_MS = 5000;
const uint32_t RETRY_INITIAL_DELAY_MS = 5000;
const uint32_t RETRY_MAX_DELAY_MS = 120000; // Do not cross more than 2 mins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems too large IMHO for default, 2 minutes is a long time.. How about 30s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine. Just put some number so that we can discuss

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch addressed all the comments. PTAL.

@ramaraochavali
Copy link
Contributor Author

@htuch when you get chance can you PTAL?

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks, a few more comments/questions.

const double multiplier)
: initial_interval_(initial_interval), max_interval_(max_interval), multiplier_(multiplier),
current_interval_(0) {
ASSERT(multiplier_ >= 1.5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want it to be more than 2? I thought people can use 1.5 as multiplier as well. that is why I modified this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest >= 1.0; 1.125 would be a valid value, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I think we discussed earlier that should be more than 1.0. So would validate against >1.0


namespace Envoy {

ExponentialBackOffStrategy::ExponentialBackOffStrategy(const uint64_t initial_interval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't make scalar args consts in Envoy (although it's not a terrible idea)..

}

TEST(BackOffStrategyTest, ExponentialBackOfReset) {
ExponentialBackOffStrategy exponential_back_off(10, 100, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do some tests with a fractional backoff multiplier?

@@ -16,6 +16,8 @@ GrpcMuxImpl::GrpcMuxImpl(const envoy::api::v2::core::Node& node, Grpc::AsyncClie
: node_(node), async_client_(std::move(async_client)), service_method_(service_method),
time_source_(time_source) {
retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); });
backoff_strategy_ptr_ = std::make_unique<ExponentialBackOffStrategy>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect any existing tests? If not, how come?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not actually.

Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch addressed the changes. PTAL..

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo final comments.


uint64_t initial_interval_;
uint64_t max_interval_;
double multiplier_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a bunch of these should be const as they are only set in the constructor.

virtual ~BackOffStrategy() {}

/**
* Returns the next backoff interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Doxygen @return here. Also, please express the units.

/**
* Returns the next backoff interval.
*/
virtual uint64_t nextBackOff() PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest renaming to nextBackoffMilliseconds or nextBackoffMs.

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch addressed. PTAL.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rad, thanks!

@htuch htuch merged commit 07b13c0 into envoyproxy:master Jul 4, 2018
@ramaraochavali ramaraochavali deleted the feature/ads_backoff branch July 4, 2018 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants