Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WatchDog Extension hook #12416

Merged
merged 16 commits into from
Aug 13, 2020
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion api/envoy/config/bootstrap/v3/bootstrap.proto
Original file line number Diff line number Diff line change
Expand Up @@ -305,10 +305,34 @@ message ClusterManager {
// Envoy process watchdog configuration. When configured, this monitors for
// nonresponsive threads and kills the process after the configured thresholds.
// See the :ref:`watchdog documentation <operations_performance_watchdog>` for more information.
// [#next-free-field: 7]
// [#next-free-field: 8]
message Watchdog {
option (udpa.annotations.versioning).previous_message_type = "envoy.config.bootstrap.v2.Watchdog";

message WatchdogAction {
// The events are fired in this order: KILL, MULTIKILL, MEGAMISS, MISS.
// Within an event type, actions execute in the order they are configured.
// For KILL/MULTIKILL there is a default PANIC that will kill the process,
// but it might be useful to specify several debug actions, and possibly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry can you clarify this a bit? Does specifying any config for KILL/MULTIKILL override the default PANIC action? It's not clear from the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to specify the PANIC() call runs after the registered actions for that case if the process wasn't yet killed.

// an alternate FATAL action.
mattklein123 marked this conversation as resolved.
Show resolved Hide resolved
enum WatchdogEvent {
UNKNOWN = 0;
KILL = 1;
MULTIKILL = 2;
MEGAMISS = 3;
MISS = 4;
}

// Extension specific configuration for the action.
core.v3.TypedExtensionConfig config = 1;
mattklein123 marked this conversation as resolved.
Show resolved Hide resolved

WatchdogEvent event = 2;
}

// Register actions that will fire on given WatchDog events.
// See *WatchDogAction* for priority of events.
repeated WatchdogAction actions = 7;

// The duration after which Envoy counts a nonresponsive thread in the
// *watchdog_miss* statistic. If not specified the default is 200ms.
google.protobuf.Duration miss_timeout = 1;
Expand Down
29 changes: 28 additions & 1 deletion api/envoy/config/bootstrap/v4alpha/bootstrap.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 5 additions & 2 deletions docs/root/extending/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,11 @@ types including:
* :ref:`Stat sinks <arch_overview_statistics>`
* :ref:`Tracers <arch_overview_tracing>`
* :ref:`Request ID <arch_overview_tracing>`
* Transport sockets
* BoringSSL private key methods
* :ref:`Transport sockets <envoy_v3_api_field_extensions.transport_sockets.tls.v3.CommonTlsContext.CertificateProvider.typed_config>`
* :ref:`BoringSSL FIPS <arch_overview_ssl_fips>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an extension point? Did you mean to ref link to the private key methods or have a different title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert the BoringSSL FIPs since I'm not entirely sure. I was basing this on pr #6326 but might be missing some context.

* :ref:`Watchdog action <envoy_v3_api_msg_config.bootstrap.v3.Watchdog.WatchdogAction>`
* :ref:`HTTP internal redirects <arch_overview_internal_redirects>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an extension point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was based off of pr #10908.

In https://github.com/envoyproxy/envoy/blob/master/api/envoy/config/route/v3/route_components.proto#L1722 there's a core.v3.TypedExtensionConfig. I linked arch_overview_internal_redirects as in the PR where the change was made http_connection_management.rst was modified primarily in that section it seems to accommodate.

I will revert here, but if you think this makes sense I'm happy to add this in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, OK. I guess I would mildly prefer that we ref link directly to the TypedExtensionConfig field to make it more obvious along with any arch docs. In any case, I don't want to hold up your PR, but any follow ups to improve the docs would be appreciated!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a separate PR for this: #12620



As of this writing there is no high level extension developer documentation. The
:repo:`existing extensions <source/extensions>` are a good way to learn what is possible.
Expand Down
11 changes: 7 additions & 4 deletions docs/root/operations/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,13 @@ Watchdog
--------

In addition to event loop statistics, Envoy also include a configurable
:ref:`watchdog <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.watchdog>` system that can increment
statistics when Envoy is not responsive and optionally kill the server. The statistics are useful
for understanding at a high level whether Envoy's event loop is not responsive either because it is
doing too much work, blocking, or not being scheduled by the OS.
:ref:`watchdog <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.watchdog>`
system that can increment statistics when Envoy is not responsive and
optionally kill the server. The system also has an extension point allowing for
custom actions to be taken based on watchdog events. The statistics are
useful for understanding at a high level whether Envoy's event loop is not
responsive either because it is doing too much work, blocking, or not being
scheduled by the OS.

The watchdog emits statistics in both the *server.* and *server.<thread_name>.* trees.
*<thread_name>* is equal to *main_thread*, *worker_0*, *worker_1*, etc.
Expand Down
1 change: 1 addition & 0 deletions docs/root/version_history/current.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ New Features
* tap: added :ref:`generic body matcher<envoy_v3_api_msg_config.tap.v3.HttpGenericBodyMatch>` to scan http requests and responses for text or hex patterns.
* tcp: switched the TCP connection pool to the new "shared" connection pool, sharing a common code base with HTTP and HTTP/2. Any unexpected behavioral changes can be temporarily reverted by setting `envoy.reloadable_features.new_tcp_connection_pool` to false.
* watchdog: support randomizing the watchdog's kill timeout to prevent synchronized kills via a maximium jitter parameter :ref:`max_kill_timeout_jitter<envoy_v3_api_field_config.bootstrap.v3.Watchdog.max_kill_timeout_jitter>`.
* watchdog: supports an extension point where actions can be registered to fire on watchdog events such as miss, megamiss, kill and multikill. See ref:`watchdog actions<envoy_v3_api_field_config.bootstrap.v3.Watchdog.actions>`.
* xds: added :ref:`extension config discovery<envoy_v3_api_msg_config.core.v3.ExtensionConfigSource>` support for HTTP filters.

Deprecated
Expand Down
26 changes: 25 additions & 1 deletion generated_api_shadow/envoy/config/bootstrap/v3/bootstrap.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 11 additions & 0 deletions include/envoy/server/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,17 @@ envoy_cc_library(
],
)

envoy_cc_library(
name = "guarddog_config_interface",
hdrs = ["guarddog_config.h"],
deps = [
":guarddog_interface",
"//include/envoy/api:api_interface",
"//include/envoy/protobuf:message_validator_interface",
"@envoy_api//envoy/config/bootstrap/v3:pkg_cc_proto",
],
)

envoy_cc_library(
name = "health_checker_config_interface",
hdrs = ["health_checker_config.h"],
Expand Down
7 changes: 7 additions & 0 deletions include/envoy/server/configuration.h
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,13 @@ class Main {
* for at least MultiKillTimeout before we kill the process.
*/
virtual double wdMultiKillThreshold() const PURE;

/**
* @return Protobuf::RepeatedPtrField<envoy::config::bootstrap::v3::Watchdog::WatchdogAction>
* the WatchDog Actions that trigger on WatchDog Events.
*/
virtual Protobuf::RepeatedPtrField<envoy::config::bootstrap::v3::Watchdog::WatchdogAction>
wdActions() const PURE;
};

/**
Expand Down
65 changes: 65 additions & 0 deletions include/envoy/server/guarddog_config.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#pragma once

#include <memory>

#include "envoy/api/api.h"
#include "envoy/common/pure.h"
#include "envoy/config/bootstrap/v3/bootstrap.pb.h"
#include "envoy/config/typed_config.h"
#include "envoy/event/dispatcher.h"
#include "envoy/protobuf/message_validator.h"
#include "envoy/server/guarddog.h"

#include "common/protobuf/protobuf.h"

namespace Envoy {
namespace Server {
namespace Configuration {

struct GuardDogActionFactoryContext {
Api::Api& api_;
Event::Dispatcher& dispatcher_; // not owned (this is the guard dog's dispatcher)
};

class GuardDogAction {
public:
virtual ~GuardDogAction() = default;
/**
* Callback function for when the GuardDog observes an event.
* @param event the event the GuardDog observes.
* @param thread_ltt_pairs pairs of the relevant thread to the event, and the
* last time touched (LTT) of those threads with their watchdog.
* @param now the current time.
*/
virtual void run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::WatchdogEvent event,
std::vector<std::pair<Thread::ThreadId, MonotonicTime>> thread_ltt_pairs,
MonotonicTime now) PURE;
};

using GuardDogActionPtr = std::unique_ptr<GuardDogAction>;

/**
* Implemented by each custom GuardDogAction and registered via Registry::registerFactory()
* or the convenience class RegisterFactory.
*/
class GuardDogActionFactory : public Config::TypedFactory {
public:
~GuardDogActionFactory() override = default;

/**
* Creates a particular GuardDog Action factory implementation.
*
* @param config supplies the configuration for the action.
* @param context supplies the GuardDog Action's context.
* @return GuardDogActionPtr the GuardDogAction object.
*/
virtual GuardDogActionPtr createGuardDogActionFromProto(
const envoy::config::bootstrap::v3::Watchdog::WatchdogAction& config,
mattklein123 marked this conversation as resolved.
Show resolved Hide resolved
GuardDogActionFactoryContext& context) PURE;

std::string category() const override { return "envoy.guarddog_actions"; }
};

} // namespace Configuration
} // namespace Server
} // namespace Envoy
4 changes: 4 additions & 0 deletions source/server/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -114,15 +114,19 @@ envoy_cc_library(
"//include/envoy/common:time_interface",
"//include/envoy/event:timer_interface",
"//include/envoy/server:configuration_interface",
"//include/envoy/server:guarddog_config_interface",
"//include/envoy/server:guarddog_interface",
"//include/envoy/server:watchdog_interface",
"//include/envoy/stats:stats_interface",
"//include/envoy/thread:thread_interface",
"//source/common/common:assert_lib",
"//source/common/common:minimal_logger_lib",
"//source/common/common:thread_lib",
"//source/common/config:utility_lib",
"//source/common/event:libevent_lib",
"//source/common/protobuf:utility_lib",
"//source/common/stats:symbol_table_lib",
"@envoy_api//envoy/config/bootstrap/v3:pkg_cc_proto",
],
)

Expand Down
1 change: 1 addition & 0 deletions source/server/configuration_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ void MainImpl::initialize(const envoy::config::bootstrap::v3::Bootstrap& bootstr
std::chrono::milliseconds(PROTOBUF_GET_MS_OR_DEFAULT(watchdog, multikill_timeout, 0));
watchdog_multikill_threshold_ =
PROTOBUF_PERCENT_TO_DOUBLE_OR_DEFAULT(watchdog, multikill_threshold, 0.0);
watchdog_actions_ = bootstrap.watchdog().actions();

initializeStatsSinks(bootstrap, server);
}
Expand Down
6 changes: 6 additions & 0 deletions source/server/configuration_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,10 @@ class MainImpl : Logger::Loggable<Logger::Id::config>, public Main {
}

double wdMultiKillThreshold() const override { return watchdog_multikill_threshold_; }
Protobuf::RepeatedPtrField<envoy::config::bootstrap::v3::Watchdog::WatchdogAction>
wdActions() const override {
return watchdog_actions_;
}

private:
/**
Expand All @@ -129,6 +133,8 @@ class MainImpl : Logger::Loggable<Logger::Id::config>, public Main {
std::chrono::milliseconds watchdog_kill_timeout_;
std::chrono::milliseconds watchdog_multikill_timeout_;
double watchdog_multikill_threshold_;
Protobuf::RepeatedPtrField<envoy::config::bootstrap::v3::Watchdog::WatchdogAction>
watchdog_actions_;
};

/**
Expand Down
Loading