-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WatchDog Extension hook #12416
WatchDog Extension hook #12416
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com> WIP, added calls to callbacks and structures to store registered callbacks. Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Exposed watchdog actions from configuration, hooked up guarddog to be able to check the registry for callbacks. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…tion object (instead of a plain cb) for the registered actions in case we need internal state, set up scaffolding for test. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Slight cleanup. Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Added an additional GuardDogActionFactory to clarify different action behaviors, and use RELEASE_ASSERT with messages instead of different signals. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@@ -339,6 +355,9 @@ message Watchdog { | |||
// nonresponsive threads required for the *multikill_timeout*. | |||
// If not specified the default is 0. | |||
type.v3.Percent multikill_threshold = 5; | |||
|
|||
// Register actions that will fire on given WatchDog events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want to say something about any guarantees about ordering that you intend to provide. They are mostly relevant for the case of KILL/MULTIKILL where it may be useful to specify several debug related actions followed by an alternate FATAL action used to replace the default guarddog initiated fatal action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good. Some minor comments below.
source/server/guarddog_impl.h
Outdated
@@ -124,6 +133,9 @@ class GuardDogImpl : public GuardDog { | |||
const std::chrono::milliseconds loop_interval_; | |||
Stats::Counter& watchdog_miss_counter_; | |||
Stats::Counter& watchdog_megamiss_counter_; | |||
using EventToActionsMap = std::unordered_map<WatchDogAction::WatchdogEvent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use absl::flat_hash_map instead of std::unordered_map as requested by clang-tidy checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
source/server/guarddog_impl.cc
Outdated
const auto& actions = config.wdActions(); | ||
for (const auto& action : actions) { | ||
if (action.event() == WatchDogAction::UNKNOWN) { | ||
ENVOY_LOG_MISC(error, "WatchDogAction specified with UNKNOWN event"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw an exception on unknown/unspecified event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will throw a ProtoValidationException()
as that seemed to be the EnvoyException
class that suits this scenario.
test/server/guarddog_impl_test.cc
Outdated
@@ -375,27 +394,31 @@ TEST_P(GuardDogMissTest, MissCountTest) { | |||
|
|||
TEST_P(GuardDogTestBase, StartStopTest) { | |||
NiceMock<Stats::MockStore> stats; | |||
NiceMock<Configuration::MockMain> config(0, 0, 0, 0, 0); | |||
std::vector<std::string> actions; | |||
NiceMock<Configuration::MockMain> config(0, 0, 0, 0, 0, actions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also write:
NiceMockConfiguration::MockMain config(0, 0, 0, 0, 0, {});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going with
NiceMockConfiguration::MockMain config(0, 0, 0, 0, 0, std::vector<std::string>{});
as with plain {}
the compiler was struggling with the type deduction.
test/server/guarddog_impl_test.cc
Outdated
// the events vector passed to it. | ||
// Instances of this class will be registered for GuardDogEvent through | ||
// TestGuardDogActionFactory. | ||
class LogGuardDogAction : public Configuration::GuardDogAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RecordGuardDogAction may be a more accurate name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/ retest |
🐴 hold your horses - no failures detected, yet. |
include/envoy/server/guarddog.h
Outdated
#include "envoy/server/watchdog.h" | ||
#include "envoy/thread/thread.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Are envoy/config/bootstrap/v3/bootstrap.pb.h and/or envoy/common/time.h used by this header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, must have happened at some point in development (but it's unnecessary)
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/assign @dio I think you're the on-call maintainer, so I'm supposed to assign it to you. PTAL. Thanks! |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
IIUC now, the on-call maintainer just triages and figures out who to assign the bug. I think I'm suppose to ask: @envoyproxy/senior-maintainers to take a look. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with small API/doc comments. Thank you!
/wait
message Watchdog { | ||
option (udpa.annotations.versioning).previous_message_type = "envoy.config.bootstrap.v2.Watchdog"; | ||
|
||
message WatchdogAction { | ||
// In priority events are fired from KILL, MULTIKILL, MEGAMISS and MISS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an "in priority event" ? Typo? Clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The events are fired in this order: KILL, MULTIKILL, MEGAMISS, MISS.
Changed it to the above to clarify.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the doc nits. Don't feel you have to spend a lot of time on that. We can come back and fix them later so feel free to revert part of what you changed if it's not obvious. Thank you!
/wait
docs/root/extending/extending.rst
Outdated
* Transport sockets | ||
* BoringSSL private key methods | ||
* :ref:`Transport sockets <envoy_v3_api_field_extensions.transport_sockets.tls.v3.CommonTlsContext.CertificateProvider.typed_config>` | ||
* :ref:`BoringSSL FIPS <arch_overview_ssl_fips>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an extension point? Did you mean to ref link to the private key methods or have a different title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will revert the BoringSSL FIPs since I'm not entirely sure. I was basing this on pr #6326 but might be missing some context.
docs/root/extending/extending.rst
Outdated
* :ref:`Transport sockets <envoy_v3_api_field_extensions.transport_sockets.tls.v3.CommonTlsContext.CertificateProvider.typed_config>` | ||
* :ref:`BoringSSL FIPS <arch_overview_ssl_fips>` | ||
* :ref:`Watchdog action <envoy_v3_api_msg_config.bootstrap.v3.Watchdog.WatchdogAction>` | ||
* :ref:`HTTP internal redirects <arch_overview_internal_redirects>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an extension point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was based off of pr #10908.
In https://github.com/envoyproxy/envoy/blob/master/api/envoy/config/route/v3/route_components.proto#L1722 there's a core.v3.TypedExtensionConfig
. I linked arch_overview_internal_redirects
as in the PR where the change was made http_connection_management.rst
was modified primarily in that section it seems to accommodate.
I will revert here, but if you think this makes sense I'm happy to add this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, OK. I guess I would mildly prefer that we ref link directly to the TypedExtensionConfig field to make it more obvious along with any arch docs. In any case, I don't want to hold up your PR, but any follow ups to improve the docs would be appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a separate PR for this: #12620
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Just a few small comments and we can ship.
/wait
// For KILL/MULTIKILL there is a default PANIC that will kill the process, | ||
// but it might be useful to specify several debug actions, and possibly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry can you clarify this a bit? Does specifying any config for KILL/MULTIKILL override the default PANIC action? It's not clear from the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to specify the PANIC() call runs after the registered actions for that case if the process wasn't yet killed.
source/server/guarddog_impl.cc
Outdated
if (action.event() == WatchDogAction::UNKNOWN) { | ||
throw ProtoValidationException("WatchDogAction specified with UNKNOWN event", action); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please a PGV annotation on the proto for enum defined only. That will make this impossible and checked by the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thanks for pointing it out. Done.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Commit Message: Updated the extending envoy documentation to include a missing extension point. Additional Description: See PR #12416 for additional context of this. Risk Level: low Testing: built the documents to ensure they generate correctly. Docs Changes: Release Notes:
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
* master: (67 commits) logger: support log control in admin interface and command line option for Fancy Logger (envoyproxy#12369) test: fix http_timeout_integration_test flake (envoyproxy#12654) [fuzz]added an input check in writefilter fuzzer and added test cases (envoyproxy#12628) add 'explicit' restriction. (envoyproxy#12643) scoped_rds_integration_test migrate from api v2 to api v3. (envoyproxy#12633) fuzz: added fuzz test for listener filter tls_inspector (envoyproxy#12617) testing: fix multiple race conditions in simulated time tests (envoyproxy#12527) [tls] Move handshaking behavior into SslSocketInfo. (envoyproxy#12571) header: getting rid of exception-throwing behaviors in header files [the rest] (envoyproxy#12611) router: add new ratelimited retry backoff strategy (envoyproxy#12202) [redis_proxy] added a constraint for route.prefix().size() (envoyproxy#12637) network: add tcp listener backlog config (envoyproxy#12625) runtime: debug log that condition is always true when fractionalPercent numerator > denominator (envoyproxy#12068) WatchDog Extension hook (envoyproxy#12416) router: add dynamic metadata header formatter (envoyproxy#11858) statsd: revert visibility to public (envoyproxy#12621) Fix regression of /build_* in gitignore (envoyproxy#12630) Added a missing extension point to documentation. (envoyproxy#12620) Reverts proxy protocol test on windows (envoyproxy#12619) caching: Improved the tests and coverage of the CacheFilter tree (envoyproxy#12544) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: Establish an extension point for actions to run based on Watch Dog Events.
Additional Description:
Risk Level: Medium
Testing: unit tests
Docs Changes: N/A
Release Notes: Included.
Fixes #11388