Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthchecks: Add interval_jitter_percent healthcheck option #3816

Merged
merged 11 commits into from
Jul 10, 2018

Conversation

julia-stripe
Copy link
Contributor

The contribution guidelines say to open an issue when suggesting a user-facing feature, but I thought it would be simpler to open a PR since it's a very small feature. Would very much welcome feedback!

Description:

This adds a new option to specify the jitter on a healthcheck as a percentage of the healthcheck interval instead of as an absolute number.

Motivation:

Right now the only way to add jitter to a healthcheck is to add a fixed amount of jitter with interval_jitter_ms.

We want to add 30s of jitter to a healthcheck when it's using the no_traffic_interval interval (60s), and 2s of jitter when it's using its regular healthcheck healthcheck interval (4s). So we'd like to set the healthcheck jitter as a percentage instead of a fixed number.

This is important for us because we're hoping it'll help us mitigate some thundering herd issues with healthchecks.

Risk Level:

Low

Testing:

TODO

Docs Changes:

Document the interval_jitter_percent option. (TODO: update version_history.rst)

Signed-off-by: Julia Evans <julia@stripe.com>
@julia-stripe julia-stripe force-pushed the interval-jitter-percent branch from 838a508 to 0f8d8d5 Compare July 9, 2018 19:40
Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a useful feature; the direction looks good, just a few minor things to cleanup.

@@ -1403,6 +1403,11 @@ const std::string Json::Schema::CLUSTER_HEALTH_CHECK_SCHEMA(R"EOF(
"minimum" : 0,
"exclusiveMinimum" : true
},
"interval_jitter_percent" : {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 config support is deprecated, and we're not adding new features to it; please revert this file (and the related test).

// An optional jitter amount as a percentage of interval_ms. If specified,
// during every interval Envoy will add 0 to interval_ms *
// interval_jitter_percent / 100 to the wait time.
google.protobuf.UInt32Value interval_jitter_percent = 18;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be can be uint32 instead of google.protobuf.UInt32Value; the default of 0 has the desired meaning of "do nothing"

google.protobuf.Duration interval_jitter = 3;

// An optional jitter amount as a percentage of interval_ms. If specified,
// during every interval Envoy will add 0 to interval_ms *
// interval_jitter_percent / 100 to the wait time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what happens if both this and interval_jitter are set. Or only allow one to be set and document that (and validate that somewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -89,6 +90,10 @@ std::chrono::milliseconds HealthCheckerImplBase::interval(HealthState state,
base_time_ms += (random_.random() % interval_jitter_.count());
}

if (interval_jitter_percent_ > 0) {
base_time_ms += (random_.random() * interval_jitter_percent_ / 100.0 * interval_.count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid floating point division here?

Signed-off-by: Julia Evans <julia@stripe.com>
Signed-off-by: Julia Evans <julia@stripe.com>
@julia-stripe julia-stripe force-pushed the interval-jitter-percent branch from 6a8c5ea to 4c642d1 Compare July 9, 2018 20:05
Signed-off-by: Julia Evans <julia@stripe.com>
Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to add tests for the new functionality. They'll probably be similar to the tests for 'interval_jitter'.

internal Envoy will add 0 to *interval_jitter_ms* milliseconds to the wait time.
interval Envoy will add 0 to *interval_jitter_ms* milliseconds to the wait time.

interval_jitter_percent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert these docs also; these are the v1 config docs.

Signed-off-by: Julia Evans <julia@stripe.com>
Signed-off-by: Julia Evans <julia@stripe.com>
Signed-off-by: Julia Evans <julia@stripe.com>
@julia-stripe
Copy link
Contributor Author

can you give me a pointer to the interval_jitter tests? I'm having trouble finding them.

@ggreenway
Copy link
Contributor

can you give me a pointer to the interval_jitter tests? I'm having trouble finding them.

It turns out the tests for jitter aren't very good. They're implicit in other tests (for instance (HttpHealthCheckerImplTest, Success)), and they leave the MockRandomGenerator in the default state of returning 0 for all random numbers.

Please add some better tests for this new functionality, and optionally (this would be hugely appreciated!) also add them for the existing interval_jitter. I think the basic structure will be setting an expectation on the MockRandomGenerator to return some pre-defined values for random-ness, and test the time parameter of enableTimer() is always within the expected range. If you're unfamiliar with gtest/gmock, feel free to ask for more details on how to write and structure the tests.

@julia-stripe
Copy link
Contributor Author

thank you! I've never used gtest/gmock before so any pointers to docs / good examples to copy would be super helpful. Will poke around and see what I can figure out on my own.

@ggreenway
Copy link
Contributor

A lot of the basics are covered very well in the docs. A few bits that will be useful for these tests:

  • You'll want to specify the "random" numbers that are used, so that you can cover corner cases, and so that the tests are deterministic. To do that, most of the test classes for health checks have a member random or random_. You can set the next "random" number with EXPECT_CALL(random_, random()).WillOnce(Return(chosen_value));

  • The effect of the calculated interval is that enableTimer() is called with the calculated interval. For cases where you know exactly what the value should be, you can set the expectation with EXPECT_CALL(*test_sessions_[0]->interval_timer_, enableTimer(std::chrono::milliseconds(calculated_value)); For cases where, for instance, you're iterating over a bunch of pre-chosen random values and just want to verify that all intervals are within the correct range, you can save the interval and then do further comparisons on it. Note that _ means match any value:

std::chrono::milliseconds timer_value;
EXPECT_CALL(*test_sessions_[0]->interval_timer_, enableTimer(_)).WillOnce(SaveArg<0>(&timer_value));
EXPECT_LT(timer_value, std::chrono::milliseconds(500));
EXPECT_GT(timer_value, std::chrono::milliseconds(250));

Feel free to ask any more questions you have; gtest/gmock takes some getting used to. Or feel free to ask in slack in envoy-dev.

Signed-off-by: Julia Evans <julia@stripe.com>
@julia-stripe
Copy link
Contributor Author

added a couple of basic tests that test interval_jitter and interval_jitter_percent. The interval_jitter_percent test revealed a bug :)

Can try to figure out how to use INSTANTIATE_TEST_CASE_P to call those tests with an array of different parameters if you think that would be useful!

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start on the tests, but I think we should test with more "random" values, and for each one, validate that the param to enableTimer() is within the expected range.

One possible approach: put a chunk of the test in a loop over numbers 0-N, with N being at least as big as the divisor in the modulo operation in the code. Have random() return n, then validate the value passed to enableTimer(). N can be lower if you change the configuration in the test to have shorter time intervals (in case the test is taking a long time to run).

For the jitterPercent test, let's add separate cases for the no-traffic-interval vs the traffic-interval, to directly validate that the thing you're originally fixing is fixed.

And let's add another test for when both types of jitter are configured, to validate that the documented behavior matches the actual behavior.

Thanks for writing tests for the under-tested original jitter code.

@ggreenway
Copy link
Contributor

Can try to figure out how to use INSTANTIATE_TEST_CASE_P to call those tests with an array of different parameters if you think that would be useful!

That's one option, but I think a loop will be good enough and then you don't have to fight gtest in figuring out how to invoke INSTANTIATE_TEST_CASE_P, but either is fine.

Signed-off-by: Julia Evans <julia@stripe.com>
@julia-stripe julia-stripe force-pushed the interval-jitter-percent branch from 5d7aea8 to fa2a4fd Compare July 10, 2018 15:02
Signed-off-by: Julia Evans <julia@stripe.com>
Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. One minor naming thing, but otherwise LGTM.

}
}

TEST_F(HttpHealthCheckerImplTest, SuccessIntervalJitterPercentNoInterval) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuccessIntervalJitterPercentNoTraffic?

Signed-off-by: Julia Evans <julia@stripe.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks.

@mattklein123 mattklein123 merged commit 03cfb87 into envoyproxy:master Jul 10, 2018
snowp added a commit to snowp/envoy that referenced this pull request Jul 12, 2018
* origin/master:
  config: making v2-config-only a boolean flag (envoyproxy#3847)
  lc trie: add exclusive flag. (envoyproxy#3825)
  upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783)
  test: deflaking header_integration_test (envoyproxy#3849)
  http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776)
  common: minor doc updates (envoyproxy#3845)
  fix master build (envoyproxy#3844)
  logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842)
  test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843)
  common: jittered backoff implementation (envoyproxy#3791)
  format: run buildifier on .bzl files. (envoyproxy#3824)
  Support mutable metadata for endpoints (envoyproxy#3814)
  test: deflaking a test, improving debugability (envoyproxy#3829)
  Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834)
  Add hard-coded /hot_restart_version test (envoyproxy#3832)
  healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816)

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants