-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
healthchecks: Add interval_jitter_percent healthcheck option #3816
healthchecks: Add interval_jitter_percent healthcheck option #3816
Conversation
Signed-off-by: Julia Evans <julia@stripe.com>
838a508
to
0f8d8d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a useful feature; the direction looks good, just a few minor things to cleanup.
source/common/json/config_schemas.cc
Outdated
@@ -1403,6 +1403,11 @@ const std::string Json::Schema::CLUSTER_HEALTH_CHECK_SCHEMA(R"EOF( | |||
"minimum" : 0, | |||
"exclusiveMinimum" : true | |||
}, | |||
"interval_jitter_percent" : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1 config support is deprecated, and we're not adding new features to it; please revert this file (and the related test).
// An optional jitter amount as a percentage of interval_ms. If specified, | ||
// during every interval Envoy will add 0 to interval_ms * | ||
// interval_jitter_percent / 100 to the wait time. | ||
google.protobuf.UInt32Value interval_jitter_percent = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be can be uint32 instead of google.protobuf.UInt32Value; the default of 0 has the desired meaning of "do nothing"
google.protobuf.Duration interval_jitter = 3; | ||
|
||
// An optional jitter amount as a percentage of interval_ms. If specified, | ||
// during every interval Envoy will add 0 to interval_ms * | ||
// interval_jitter_percent / 100 to the wait time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document what happens if both this and interval_jitter are set. Or only allow one to be set and document that (and validate that somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@@ -89,6 +90,10 @@ std::chrono::milliseconds HealthCheckerImplBase::interval(HealthState state, | |||
base_time_ms += (random_.random() % interval_jitter_.count()); | |||
} | |||
|
|||
if (interval_jitter_percent_ > 0) { | |||
base_time_ms += (random_.random() * interval_jitter_percent_ / 100.0 * interval_.count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid floating point division here?
Signed-off-by: Julia Evans <julia@stripe.com>
Signed-off-by: Julia Evans <julia@stripe.com>
6a8c5ea
to
4c642d1
Compare
Signed-off-by: Julia Evans <julia@stripe.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to add tests for the new functionality. They'll probably be similar to the tests for 'interval_jitter'.
internal Envoy will add 0 to *interval_jitter_ms* milliseconds to the wait time. | ||
interval Envoy will add 0 to *interval_jitter_ms* milliseconds to the wait time. | ||
|
||
interval_jitter_percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert these docs also; these are the v1 config docs.
Signed-off-by: Julia Evans <julia@stripe.com>
Signed-off-by: Julia Evans <julia@stripe.com>
Signed-off-by: Julia Evans <julia@stripe.com>
can you give me a pointer to the |
It turns out the tests for jitter aren't very good. They're implicit in other tests (for instance Please add some better tests for this new functionality, and optionally (this would be hugely appreciated!) also add them for the existing |
thank you! I've never used gtest/gmock before so any pointers to docs / good examples to copy would be super helpful. Will poke around and see what I can figure out on my own. |
A lot of the basics are covered very well in the docs. A few bits that will be useful for these tests:
Feel free to ask any more questions you have; gtest/gmock takes some getting used to. Or feel free to ask in slack in envoy-dev. |
Signed-off-by: Julia Evans <julia@stripe.com>
added a couple of basic tests that test Can try to figure out how to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start on the tests, but I think we should test with more "random" values, and for each one, validate that the param to enableTimer() is within the expected range.
One possible approach: put a chunk of the test in a loop over numbers 0-N, with N being at least as big as the divisor in the modulo operation in the code. Have random() return n, then validate the value passed to enableTimer(). N can be lower if you change the configuration in the test to have shorter time intervals (in case the test is taking a long time to run).
For the jitterPercent test, let's add separate cases for the no-traffic-interval vs the traffic-interval, to directly validate that the thing you're originally fixing is fixed.
And let's add another test for when both types of jitter are configured, to validate that the documented behavior matches the actual behavior.
Thanks for writing tests for the under-tested original jitter code.
That's one option, but I think a loop will be good enough and then you don't have to fight gtest in figuring out how to invoke INSTANTIATE_TEST_CASE_P, but either is fine. |
Signed-off-by: Julia Evans <julia@stripe.com>
5d7aea8
to
fa2a4fd
Compare
Signed-off-by: Julia Evans <julia@stripe.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. One minor naming thing, but otherwise LGTM.
} | ||
} | ||
|
||
TEST_F(HttpHealthCheckerImplTest, SuccessIntervalJitterPercentNoInterval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SuccessIntervalJitterPercentNoTraffic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks.
* origin/master: config: making v2-config-only a boolean flag (envoyproxy#3847) lc trie: add exclusive flag. (envoyproxy#3825) upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783) test: deflaking header_integration_test (envoyproxy#3849) http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776) common: minor doc updates (envoyproxy#3845) fix master build (envoyproxy#3844) logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842) test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843) common: jittered backoff implementation (envoyproxy#3791) format: run buildifier on .bzl files. (envoyproxy#3824) Support mutable metadata for endpoints (envoyproxy#3814) test: deflaking a test, improving debugability (envoyproxy#3829) Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834) Add hard-coded /hot_restart_version test (envoyproxy#3832) healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816) Signed-off-by: Snow Pettersen <snowp@squareup.com>
The contribution guidelines say to open an issue when suggesting a user-facing feature, but I thought it would be simpler to open a PR since it's a very small feature. Would very much welcome feedback!
Description:
This adds a new option to specify the jitter on a healthcheck as a percentage of the healthcheck interval instead of as an absolute number.
Motivation:
Right now the only way to add jitter to a healthcheck is to add a fixed amount of jitter with
interval_jitter_ms
.We want to add 30s of jitter to a healthcheck when it's using the
no_traffic_interval
interval (60s), and 2s of jitter when it's using its regular healthcheck healthcheck interval (4s). So we'd like to set the healthcheck jitter as a percentage instead of a fixed number.This is important for us because we're hoping it'll help us mitigate some thundering herd issues with healthchecks.
Risk Level:
Low
Testing:
TODO
Docs Changes:
Document the
interval_jitter_percent
option. (TODO: update version_history.rst)