-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watchdog action that will signal a particular thread to abort. #12860
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this, it should make it easier to get the stack of the stuck thread when the guarddog triggers.
api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto
Outdated
Show resolved
Hide resolved
api/envoy/extensions/watchdog/abort_action/v3alpha/abort_action.proto
Outdated
Show resolved
Hide resolved
// Signal to test thread that tid has been set. | ||
{ | ||
absl::MutexLock lock(&mutex_); | ||
outstanding_notifies_ += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using absl::Notification instead of implementing your own version.
Here you could call n.Notify();
You would call n.WaitForNotification() before code that depends on tid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now); | ||
}; | ||
|
||
EXPECT_DEATH(die_function(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there's no specific string that we can assert in the EXPECT_DEATH
Does Envoy install a failure signal handler? Should we consider testing by installing a signal handler for SIGABRT that records which thread received the SIGABRT signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN
it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL
is part of the print out)
I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that right now you can't distinguish between death due to the kill signal vs the call to panic. The kill signal code could be removed and this test would still pass.
|
||
class AbortActionTest : public testing::Test { | ||
protected: | ||
AbortActionTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to restore signal handlers in the test destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since sigaction
s are done within EXPECT_DEATH
and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the code review
|
||
// Abort from the action since the signaled thread hasn't yet crashed the process. | ||
PANIC( | ||
fmt::format("Failed to kill thread with id {}, aborting from AbortAction instead.", raw_tid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 reasons I think this would be handy (vs deferring to the underlying watchdog default actions):
- Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.
- It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default
- Allows flexibility for using this in places where there isn't a default PANIC afterwards
// Assume POSIX-compatible system and signal to the thread. | ||
ENVOY_LOG_MISC(error, "AbortAction sending abort signal to thread with tid {}.", raw_tid); | ||
|
||
if (kill(toPlatformTid(raw_tid), SIGABRT) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think I'll keep it simple for now given that the default envoy handlers views SIGABRT as a fatal signal.
* This is currently only implemented for systems that support kill to send | ||
* signals. | ||
*/ | ||
class AbortAction : public Server::Configuration::GuardDogAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terminology of these two get a bit messy; in the docs and at a high level we talk about WatchDog
and the Watch dog system
, while in the actual implementation we have a watchdog
per thread and a guarddog
that manages the watchdogs and will actually be executing these functions.
I went with Watchdog
since it seemed more friendly to folks who aren't digging down into the implementation details.
|
||
class AbortActionTest : public testing::Test { | ||
protected: | ||
AbortActionTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since sigaction
s are done within EXPECT_DEATH
and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.
// Signal to test thread that tid has been set. | ||
{ | ||
absl::MutexLock lock(&mutex_); | ||
outstanding_notifies_ += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now); | ||
}; | ||
|
||
EXPECT_DEATH(die_function(), ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN
it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL
is part of the print out)
I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.
/retest |
Retrying Azure Pipelines, to retry CircleCI checks, use |
|
||
void handler(int sig, siginfo_t* /*siginfo*/, void* /*context*/) { | ||
std::cout << "Eating signal :" << std::to_string(sig) << ". will ignore it." << std::endl; | ||
signal(SIGABRT, SIG_IGN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call to signal(SIGABRT, SIG_IGN); required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly, but it prevents subsequent printouts of Eating signal:
PTAL @envoyproxy/senior-maintainers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/wait
// [#protodoc-title: Watchdog Action that sends a SIGABRT to kill the process.] | ||
// [#extension: envoy.watchdog.abort_action] | ||
|
||
// Configuration for the profile watchdog action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment seems wrong/out of date. Note that this comment is the primary documentation for this in the generated docs. It should include some information about what this does, when/why use it, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added additional information from the action's c++ implementation as you suggested.
namespace AbortAction { | ||
|
||
/** | ||
* A GuardDogAction that will terminate the process by sending SIGABRT to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps some/all of this comment block should be included with the proto config, so it gets included in generated docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, unless we want to pursue making this the default.
// more useful than the default watchdog kill behaviors since those PANIC | ||
// from the watchdog's thread. | ||
|
||
// This is currently only implemented for systems that support kill to send |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the default, on supported platforms? Is there any downside to this? It seems to me that it gives better information (or a chance of it), and the same net behavior (process terminated). @envoyproxy/maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that being default in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also fine here if we want this, seems generally useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I can make it be a default action in a future PR.
The main downside I see of making it a default now would be different behaviors across platforms due to a lack of support on Windows. Should we wait for parity before we make it a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to make it the default, because the only platform difference should be diagnostic output. The process is terminated on all platforms.
I'm also fine with making it the default in a future PR. It's your choice, @KBaichoo .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll submit a follow up to this adding this action to the watchdog actions by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to make this the default.
PTAL @envoyproxy/api-shepherds |
/lgtm api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, cool stuff. LGTM with some small comments.
/wait-any
|
||
#ifdef WIN32 | ||
// TODO(kbaichoo): add support for this with windows. | ||
ENVOY_LOG_MISC(error, "Watchdog AbortAction is unimplemented for Windows."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually needed? I think Windows has its own extension bzl file so I'm not sure this is even compiled? Could we actually avoid the ifdef in here now or just replace with an proprocessor error if this extension gets compiled on windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know this is a possibility, thanks for pointing it out.
I've added the extension to WINDOWS_SKIP_TARGETS
which IIUC will let it be skipped by windows so I can remove the ifdef #Win32
from code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out you also need to add tags = ["skip_on_windows"]
on tests since excluding the extension in the .bzl
doesn't effect the tests.
// Abort from the action since the signaled thread hasn't yet crashed the process. | ||
PANIC(fmt::format("Failed to kill thread with id {}, aborting from Watchdog AbortAction instead.", | ||
raw_tid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't kept fully up to date on this, but isn't the default action at the end of the action chain to terminate the process? Do we need this or would the action chain just end up aborting anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I discussed this with antonio up above. I've copied the reasoning below (copied from #12860 (comment))
There are 3 reasons I think this [having a panic in the action] would be handy (vs deferring to the underlying watchdog default actions[its panic there]):
- Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.
- It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default
- Allows flexibility for using this in places where there isn't a default PANIC afterwards
Given that we're planning to make this also a default action on platform where it's supported I'd likely make the following change in the next PR:
- If we're doing kill or multikill, then install the abort action as the final watchdog action for either of those events.
I could possibly see when this has full support across platforms that we'd remove the watchdog's default panic, since it'd end up being dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that make sense. Can you summarize this comment in the code? Thank you.
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Can you merge main? It should fix coverage. |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Sorry coverage failure looks legit. Can you check? /wait |
the LCOV from the CI is: https://storage.googleapis.com/envoy-pr/12860/coverage/source/extensions/watchdog/abort_action/abort_action.cc.gcov.html But if we look at the tests, we do "cover" those lines, but they end up being wrapped in I will do a per extension percentage override to work around it. |
…H tests. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think death tests sometimes do count for coverage, but I don't know when/how that happens. It would be nice to fix this at some point but LGTM for now!
Still broken. :( /wait |
Ah, 🤦 the comment lines aren't "covered"... so the percentages have dropped slightly. |
… line coverage percent. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
947bd06
Signed-off-by: Kevin Baichoo kbaichoo@google.com
For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md
Commit Message: Watchdog action that will signal a particular thread to abort.
Additional Description:
Risk Level: medium
Testing: Unit tests
Docs Changes: Included
Release Notes: Included
Issue: #11388