Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog action that will signal a particular thread to abort. #12860

Merged
merged 14 commits into from
Sep 16, 2020

Conversation

KBaichoo
Copy link
Contributor

@KBaichoo KBaichoo commented Aug 27, 2020

Signed-off-by: Kevin Baichoo kbaichoo@google.com

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Commit Message: Watchdog action that will signal a particular thread to abort.
Additional Description:
Risk Level: medium
Testing: Unit tests
Docs Changes: Included
Release Notes: Included
Issue: #11388

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12860 was opened by KBaichoo.

see: more, trace.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, it should make it easier to get the stack of the stuck thread when the guarddog triggers.

docs/root/version_history/current.rst Outdated Show resolved Hide resolved
source/extensions/watchdog/abort_action/BUILD Outdated Show resolved Hide resolved
source/extensions/watchdog/abort_action/abort_action.cc Outdated Show resolved Hide resolved
source/extensions/watchdog/abort_action/config.h Outdated Show resolved Hide resolved
// Signal to test thread that tid has been set.
{
absl::MutexLock lock(&mutex_);
outstanding_notifies_ += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using absl::Notification instead of implementing your own version.

Here you could call n.Notify();
You would call n.WaitForNotification() before code that depends on tid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now);
};

EXPECT_DEATH(die_function(), "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there's no specific string that we can assert in the EXPECT_DEATH

Does Envoy install a failure signal handler? Should we consider testing by installing a signal handler for SIGABRT that records which thread received the SIGABRT signal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL is part of the print out)

I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that right now you can't distinguish between death due to the kill signal vs the call to panic. The kill signal code could be removed and this test would still pass.


class AbortActionTest : public testing::Test {
protected:
AbortActionTest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to restore signal handlers in the test destructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since sigactions are done within EXPECT_DEATH and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Contributor Author

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the code review


// Abort from the action since the signaled thread hasn't yet crashed the process.
PANIC(
fmt::format("Failed to kill thread with id {}, aborting from AbortAction instead.", raw_tid));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 reasons I think this would be handy (vs deferring to the underlying watchdog default actions):

  • Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.
  • It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default
  • Allows flexibility for using this in places where there isn't a default PANIC afterwards

// Assume POSIX-compatible system and signal to the thread.
ENVOY_LOG_MISC(error, "AbortAction sending abort signal to thread with tid {}.", raw_tid);

if (kill(toPlatformTid(raw_tid), SIGABRT) == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think I'll keep it simple for now given that the default envoy handlers views SIGABRT as a fatal signal.

* This is currently only implemented for systems that support kill to send
* signals.
*/
class AbortAction : public Server::Configuration::GuardDogAction {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology of these two get a bit messy; in the docs and at a high level we talk about WatchDog and the Watch dog system, while in the actual implementation we have a watchdog per thread and a guarddog that manages the watchdogs and will actually be executing these functions.

I went with Watchdog since it seemed more friendly to folks who aren't digging down into the implementation details.


class AbortActionTest : public testing::Test {
protected:
AbortActionTest()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since sigactions are done within EXPECT_DEATH and we end up forking, it doesn't affect the test suite process but rather the child process which will end up dying, so we shouldn't need to restore signal handlers.

// Signal to test thread that tid has been set.
{
absl::MutexLock lock(&mutex_);
outstanding_notifies_ += 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::KILL, tid_ltt_pairs, now);
};

EXPECT_DEATH(die_function(), "");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Envoy installs failure handlers; it does seem that under different compilation modes this test will report different failure strings -- specifically under ASAN it seems to trigger some asan specific code that isn't triggered in the other test below or when running this test in other models (ASAN:DEADLYSIGNAL is part of the print out)

I can try to add some extra logic in if you think it'd be important to figure out the exact string that caused death.

@KBaichoo
Copy link
Contributor Author

KBaichoo commented Sep 1, 2020

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #12860 (comment) was created by @KBaichoo.

see: more, trace.

antoniovicente
antoniovicente previously approved these changes Sep 3, 2020

void handler(int sig, siginfo_t* /*siginfo*/, void* /*context*/) {
std::cout << "Eating signal :" << std::to_string(sig) << ". will ignore it." << std::endl;
signal(SIGABRT, SIG_IGN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call to signal(SIGABRT, SIG_IGN); required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly, but it prevents subsequent printouts of Eating signal:

@KBaichoo
Copy link
Contributor Author

KBaichoo commented Sep 3, 2020

PTAL @envoyproxy/senior-maintainers

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

// [#protodoc-title: Watchdog Action that sends a SIGABRT to kill the process.]
// [#extension: envoy.watchdog.abort_action]

// Configuration for the profile watchdog action.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment seems wrong/out of date. Note that this comment is the primary documentation for this in the generated docs. It should include some information about what this does, when/why use it, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added additional information from the action's c++ implementation as you suggested.

source/extensions/watchdog/abort_action/abort_action.cc Outdated Show resolved Hide resolved
namespace AbortAction {

/**
* A GuardDogAction that will terminate the process by sending SIGABRT to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps some/all of this comment block should be included with the proto config, so it gets included in generated docs.

Copy link
Contributor Author

@KBaichoo KBaichoo Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
ggreenway
ggreenway previously approved these changes Sep 10, 2020
Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, unless we want to pursue making this the default.

// more useful than the default watchdog kill behaviors since those PANIC
// from the watchdog's thread.

// This is currently only implemented for systems that support kill to send
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the default, on supported platforms? Is there any downside to this? It seems to me that it gives better information (or a chance of it), and the same net behavior (process terminated). @envoyproxy/maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that being default in a future PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also fine here if we want this, seems generally useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I can make it be a default action in a future PR.

The main downside I see of making it a default now would be different behaviors across platforms due to a lack of support on Windows. Should we wait for parity before we make it a default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to make it the default, because the only platform difference should be diagnostic output. The process is terminated on all platforms.

I'm also fine with making it the default in a future PR. It's your choice, @KBaichoo .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll submit a follow up to this adding this action to the watchdog actions by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make this the default.

@KBaichoo
Copy link
Contributor Author

PTAL @envoyproxy/api-shepherds

@htuch
Copy link
Member

htuch commented Sep 11, 2020

/lgtm api

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, cool stuff. LGTM with some small comments.

/wait-any


#ifdef WIN32
// TODO(kbaichoo): add support for this with windows.
ENVOY_LOG_MISC(error, "Watchdog AbortAction is unimplemented for Windows.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed? I think Windows has its own extension bzl file so I'm not sure this is even compiled? Could we actually avoid the ifdef in here now or just replace with an proprocessor error if this extension gets compiled on windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know this is a possibility, thanks for pointing it out.

I've added the extension to WINDOWS_SKIP_TARGETS which IIUC will let it be skipped by windows so I can remove the ifdef #Win32 from code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out you also need to add tags = ["skip_on_windows"] on tests since excluding the extension in the .bzl doesn't effect the tests.

Comment on lines 60 to 62
// Abort from the action since the signaled thread hasn't yet crashed the process.
PANIC(fmt::format("Failed to kill thread with id {}, aborting from Watchdog AbortAction instead.",
raw_tid));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't kept fully up to date on this, but isn't the default action at the end of the action chain to terminate the process? Do we need this or would the action chain just end up aborting anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I discussed this with antonio up above. I've copied the reasoning below (copied from #12860 (comment))

There are 3 reasons I think this [having a panic in the action] would be handy (vs deferring to the underlying watchdog default actions[its panic there]):

  • Easier to see where the failure occurred, and that signaling failed without having to jump around as much to understand the behavior.
  • It's also keeps this self-contained if there were changes on the underlying system behavior in watchdog kill / multikill default
  • Allows flexibility for using this in places where there isn't a default PANIC afterwards

Given that we're planning to make this also a default action on platform where it's supported I'd likely make the following change in the next PR:

  • If we're doing kill or multikill, then install the abort action as the final watchdog action for either of those events.

I could possibly see when this has full support across platforms that we'd remove the watchdog's default panic, since it'd end up being dead code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that make sense. Can you summarize this comment in the code? Thank you.

/wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@mattklein123
Copy link
Member

Can you merge main? It should fix coverage.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
mattklein123
mattklein123 previously approved these changes Sep 15, 2020
@mattklein123
Copy link
Member

Sorry coverage failure looks legit.

Can you check?

/wait

@KBaichoo
Copy link
Contributor Author

the LCOV from the CI is: https://storage.googleapis.com/envoy-pr/12860/coverage/source/extensions/watchdog/abort_action/abort_action.cc.gcov.html

But if we look at the tests, we do "cover" those lines, but they end up being wrapped in EXPECT_DEATH since it's going to kill the process. EXPECT_DEATH code paths are run in a sub-process of the main unit testing process. The code coverage tools will never report full coverage for most of the test then.

I will do a per extension percentage override to work around it.

…H tests.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
antoniovicente previously approved these changes Sep 15, 2020
mattklein123
mattklein123 previously approved these changes Sep 15, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think death tests sometimes do count for coverage, but I don't know when/how that happens. It would be nice to fix this at some point but LGTM for now!

@mattklein123
Copy link
Member

Still broken. :(

/wait

@KBaichoo
Copy link
Contributor Author

Ah, 🤦 the comment lines aren't "covered"... so the percentages have dropped slightly.

… line coverage percent.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@mattklein123 mattklein123 merged commit cd5bdc4 into envoyproxy:master Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants