Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WatchDog Extension hook #12416

Merged
merged 16 commits into from
Aug 13, 2020
Merged

WatchDog Extension hook #12416

merged 16 commits into from
Aug 13, 2020

Conversation

KBaichoo
Copy link
Contributor

@KBaichoo KBaichoo commented Jul 31, 2020

Commit Message: Establish an extension point for actions to run based on Watch Dog Events.
Additional Description:
Risk Level: Medium
Testing: unit tests
Docs Changes: N/A
Release Notes: Included.
Fixes #11388

KBaichoo added 3 commits July 31, 2020 17:44
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

WIP, added calls to callbacks and structures to store registered callbacks.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Exposed watchdog actions from configuration, hooked up guarddog to be able to check the registry for callbacks.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…tion object (instead of a plain cb) for the registered actions in case we need internal state, set up scaffolding for test.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Slight cleanup.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Added an additional GuardDogActionFactory to clarify different action behaviors, and use RELEASE_ASSERT with messages instead of different signals.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12416 was opened by KBaichoo.

see: more, trace.

@KBaichoo KBaichoo mentioned this pull request Jul 31, 2020
@@ -339,6 +355,9 @@ message Watchdog {
// nonresponsive threads required for the *multikill_timeout*.
// If not specified the default is 0.
type.v3.Percent multikill_threshold = 5;

// Register actions that will fire on given WatchDog events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to say something about any guarantees about ordering that you intend to provide. They are mostly relevant for the case of KILL/MULTIKILL where it may be useful to specify several debug related actions followed by an alternate FATAL action used to replace the default guarddog initiated fatal action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good. Some minor comments below.

include/envoy/server/guarddog_config.h Show resolved Hide resolved
@@ -124,6 +133,9 @@ class GuardDogImpl : public GuardDog {
const std::chrono::milliseconds loop_interval_;
Stats::Counter& watchdog_miss_counter_;
Stats::Counter& watchdog_megamiss_counter_;
using EventToActionsMap = std::unordered_map<WatchDogAction::WatchdogEvent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use absl::flat_hash_map instead of std::unordered_map as requested by clang-tidy checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const auto& actions = config.wdActions();
for (const auto& action : actions) {
if (action.event() == WatchDogAction::UNKNOWN) {
ENVOY_LOG_MISC(error, "WatchDogAction specified with UNKNOWN event");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception on unknown/unspecified event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will throw a ProtoValidationException() as that seemed to be the EnvoyException class that suits this scenario.

@@ -375,27 +394,31 @@ TEST_P(GuardDogMissTest, MissCountTest) {

TEST_P(GuardDogTestBase, StartStopTest) {
NiceMock<Stats::MockStore> stats;
NiceMock<Configuration::MockMain> config(0, 0, 0, 0, 0);
std::vector<std::string> actions;
NiceMock<Configuration::MockMain> config(0, 0, 0, 0, 0, actions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also write:
NiceMockConfiguration::MockMain config(0, 0, 0, 0, 0, {});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up going with

NiceMockConfiguration::MockMain config(0, 0, 0, 0, 0, std::vector<std::string>{});

as with plain {} the compiler was struggling with the type deduction.

// the events vector passed to it.
// Instances of this class will be registered for GuardDogEvent through
// TestGuardDogActionFactory.
class LogGuardDogAction : public Configuration::GuardDogAction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RecordGuardDogAction may be a more accurate name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

KBaichoo commented Aug 4, 2020

/ retest

@repokitteh-read-only
Copy link

🐴 hold your horses - no failures detected, yet.

🐱

Caused by: a #12416 (comment) was created by @KBaichoo.

see: more, trace.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
antoniovicente previously approved these changes Aug 4, 2020
#include "envoy/server/watchdog.h"
#include "envoy/thread/thread.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Are envoy/config/bootstrap/v3/bootstrap.pb.h and/or envoy/common/time.h used by this header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, must have happened at some point in development (but it's unnecessary)

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

KBaichoo commented Aug 5, 2020

/assign @dio

I think you're the on-call maintainer, so I'm supposed to assign it to you. PTAL.

Thanks!

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
antoniovicente previously approved these changes Aug 7, 2020
@KBaichoo
Copy link
Contributor Author

KBaichoo commented Aug 7, 2020

IIUC now, the on-call maintainer just triages and figures out who to assign the bug.

I think I'm suppose to ask: @envoyproxy/senior-maintainers to take a look. PTAL.

@mattklein123 mattklein123 self-assigned this Aug 7, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small API/doc comments. Thank you!

/wait

message Watchdog {
option (udpa.annotations.versioning).previous_message_type = "envoy.config.bootstrap.v2.Watchdog";

message WatchdogAction {
// In priority events are fired from KILL, MULTIKILL, MEGAMISS and MISS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an "in priority event" ? Typo? Clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events are fired in this order: KILL, MULTIKILL, MEGAMISS, MISS.

Changed it to the above to clarify.

api/envoy/config/bootstrap/v3/bootstrap.proto Show resolved Hide resolved
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the doc nits. Don't feel you have to spend a lot of time on that. We can come back and fix them later so feel free to revert part of what you changed if it's not obvious. Thank you!

/wait

* Transport sockets
* BoringSSL private key methods
* :ref:`Transport sockets <envoy_v3_api_field_extensions.transport_sockets.tls.v3.CommonTlsContext.CertificateProvider.typed_config>`
* :ref:`BoringSSL FIPS <arch_overview_ssl_fips>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an extension point? Did you mean to ref link to the private key methods or have a different title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert the BoringSSL FIPs since I'm not entirely sure. I was basing this on pr #6326 but might be missing some context.

* :ref:`Transport sockets <envoy_v3_api_field_extensions.transport_sockets.tls.v3.CommonTlsContext.CertificateProvider.typed_config>`
* :ref:`BoringSSL FIPS <arch_overview_ssl_fips>`
* :ref:`Watchdog action <envoy_v3_api_msg_config.bootstrap.v3.Watchdog.WatchdogAction>`
* :ref:`HTTP internal redirects <arch_overview_internal_redirects>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an extension point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was based off of pr #10908.

In https://github.com/envoyproxy/envoy/blob/master/api/envoy/config/route/v3/route_components.proto#L1722 there's a core.v3.TypedExtensionConfig. I linked arch_overview_internal_redirects as in the PR where the change was made http_connection_management.rst was modified primarily in that section it seems to accommodate.

I will revert here, but if you think this makes sense I'm happy to add this in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, OK. I guess I would mildly prefer that we ref link directly to the TypedExtensionConfig field to make it more obvious along with any arch docs. In any case, I don't want to hold up your PR, but any follow ups to improve the docs would be appreciated!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a separate PR for this: #12620

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Just a few small comments and we can ship.

/wait

Comment on lines 315 to 316
// For KILL/MULTIKILL there is a default PANIC that will kill the process,
// but it might be useful to specify several debug actions, and possibly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry can you clarify this a bit? Does specifying any config for KILL/MULTIKILL override the default PANIC action? It's not clear from the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to specify the PANIC() call runs after the registered actions for that case if the process wasn't yet killed.

Comment on lines 64 to 65
if (action.event() == WatchDogAction::UNKNOWN) {
throw ProtoValidationException("WatchDogAction specified with UNKNOWN event", action);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please a PGV annotation on the proto for enum defined only. That will make this impossible and checked by the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thanks for pointing it out. Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
jmarantz pushed a commit that referenced this pull request Aug 13, 2020
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>

Commit Message: Updated the extending envoy documentation to include a missing extension point.
Additional Description: See PR #12416 for additional context of this.
Risk Level: low
Testing: built the documents to ensure they generate correctly.
Docs Changes:
Release Notes:
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 8c312f2 into envoyproxy:master Aug 13, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Aug 14, 2020
* master: (67 commits)
  logger: support log control in admin interface and command line option for Fancy Logger (envoyproxy#12369)
  test: fix http_timeout_integration_test flake (envoyproxy#12654)
  [fuzz]added an input check in writefilter fuzzer and added test cases (envoyproxy#12628)
  add 'explicit' restriction. (envoyproxy#12643)
  scoped_rds_integration_test migrate from api v2 to api v3. (envoyproxy#12633)
  fuzz: added fuzz test for listener filter tls_inspector (envoyproxy#12617)
  testing: fix multiple race conditions in simulated time tests (envoyproxy#12527)
  [tls] Move handshaking behavior into SslSocketInfo. (envoyproxy#12571)
  header: getting rid of exception-throwing behaviors in header files [the rest] (envoyproxy#12611)
  router: add new ratelimited retry backoff strategy (envoyproxy#12202)
  [redis_proxy] added a constraint for route.prefix().size() (envoyproxy#12637)
  network: add tcp listener backlog config (envoyproxy#12625)
  runtime: debug log that condition is always true when fractionalPercent numerator > denominator (envoyproxy#12068)
  WatchDog Extension hook (envoyproxy#12416)
  router: add dynamic metadata header formatter (envoyproxy#11858)
  statsd: revert visibility to public (envoyproxy#12621)
  Fix regression of /build_* in gitignore (envoyproxy#12630)
  Added a missing extension point to documentation. (envoyproxy#12620)
  Reverts proxy protocol test on windows (envoyproxy#12619)
  caching: Improved the tests and coverage of the CacheFilter tree (envoyproxy#12544)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[watchdog] Provide additional watchdog actions and/or extension points
4 participants