Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean Shutdown on Debug Builds (Valgrind-friendly) #2535

Closed

Conversation

AE1020
Copy link
Contributor

@AE1020 AE1020 commented May 1, 2021

#119 expresses the desire to use Valgrind. However, Firecracker's only modality of exiting the process was from within nested stacks using unsafe { libc::_exit(...) } from the main thread. This would mean not running destructors for anything on other threads, or above the stack on the main thread. It also means that there may be many shutdown paths which have faulty logic and are never being tested. (Such was the case with Vmm's Drop and VcpuHandle's Drop, which could never run due to reference cycles.)

However, abrupt exits are faster in release build situations, where one is not looking for leaks. Also, adding clean exit mechanics means creating new code paths which may deadlock and not quit as reliably.

This PR attempts to bring the best of both approaches by adding a --cleanup-level switch by running a Valgrind-friendly exit mode in debug builds.

The result of Valgrind on running the Alpine Linux image through boot and shutdown is:

==52247== LEAK SUMMARY:
==52247==    definitely lost: 0 bytes in 0 blocks
==52247==    indirectly lost: 0 bytes in 0 blocks
==52247==      possibly lost: 0 bytes in 0 blocks
==52247==    still reachable: 8,594 bytes in 10 blocks
==52247==         suppressed: 0 bytes in 0 blocks

Here are backtraces of the still reachable allocations. They may not represent leaks, and should be in a suppression list if they don't. As examples get larger, having a baseline gives much better odds of discerning what's a leak and what is not.

Description of Changes

The changes are broken into 4 parts, each with a commit description detailing what they do:

  • Basic methodology for bubbling up an ExitCode all the way up from detecting an exit condition to main(), so that all destructors on the main thread can run. (Although they could run in theory, some didn't in practice, due to reference cycles which needed resolving.)
  • Graceful exit mechanics for the API server thread.
  • Graceful exit mechanics for the VCPU threads.
  • Addition of a --cleanup-level switch, reintroducing the original exit semantics as the default behavior.
  • Limiting the clean exit behavior to the debug build via #[cfg(debug_assertions)] annotations.

Open questions and comments are inline in the code.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • [N/A] Any API changes are reflected in firecracker/swagger.yaml.
  • Any user-facing changes are mentioned in CHANGELOG.md.

Regarding testing, my VirtualBox is very bad at that... sending PR for discussion and so that CI runs it.

@AE1020 AE1020 mentioned this pull request May 1, 2021
@AE1020 AE1020 force-pushed the valgrind-clean-shutdown branch 3 times, most recently from 4434d2c to deb7b04 Compare May 1, 2021 21:07
AE1020 added 3 commits May 1, 2021 17:37
The design of the existing main() function was such that Firecracker
never unwound the stack to the top level in order to end the program.
Instead, it would run some Firecracker-specific shutdown code and
then call `unsafe { libc::_exit() }`.

That skips some Rust-specific shutdown done by std::process::exit().
But even using that does not call destructors for the stack objects,
so it is not a "clean shutdown":

  https://doc.rust-lang.org/std/process/fn.exit.html

While a "dirty shutdown" can be a bit faster, the benefit of having
non-panic shutdowns run their finalization is that it helps with the
accounting of Valgrind and sanitizers.  It's a good sanity check to
make sure the program could exit cleanly if it wanted to.

This commit attempts a somewhat minimally-invasive means of bubbling
up an "ExitCode".  Exiting was managed uniquely by the Vmm event
subscriber during the process() callback, and requires running the
vmm.stop() method.  So this extends the Subscriber trait with an
extended form of callback named process_exitable() that returns
an optional exit code.

The event loop doesn't have a clear (to me) way to get at the VMM
to call stop() if a non-Vmm subscriber requests an exit.  So the
interface means any other subscriber that overrode process_exitable()
would miss the required stop code.

So it suggests the need for a redesign by someone more familiar with
the architecture.  However, it does accomplish a proof of concept by
getting closer to a clean bill of health from a Valgrind analysis.
The general presumption for Firecracker was that it would be exited
via an exit() call.  This meant there was no way to signal threads
to exit their loops so they could be join()'d to release their
resources.

Not being able to exit the stack normally (if one wants to) misses
a sanity check that shutting down cleanly provides.  The output of
Valgrind and other tools is also less useful.

Because the API Server thread is in a loop that does a blocking
wait on a socket, the only way to tell it to stop looping is via
information on that socket.  This uses an internal HTTP GET signal
to request that shutdown.  Rather than forget the thread handle
and the socket bind location, the main thread now sends the
shutdown request and then waits while the thread joins.

The request is for internal use only, however--because it is
supposed to happen after the Vmm has been cleaned up and its
contents destructed.

A different approach for signaling the thread for this purpose
could be used, e.g. to have the API server thread waiting on
multiple events (the socket or this internal signal).  But since
this is only needed for clean shutdown, it is probably good enough.
A faster shutdown using exit() would be a better default--this
will likely only be used for debugging and Valgrind/sanitization.
This continues the work from the prior commit toward the goal of
releasing all resources with a clean shutdown.  The VCPU threads in
non-test configurations were running with an unbreakable wait, so
they only way to get past them was with exit():

    // Wait indefinitely
    // The VMM thread will kill the entire process
    let barrier = Barrier::new(2);
    barrier.wait();

Here that is changed, along with several other entangled issues.
One was that Drop methods were used to try and clean up the Vmm
and the VcpuHandles, but these were not being called.  The reason
is that there was a cyclic dependency: threads held onto the Vmm,
so the Vmm could not exit VCPU threads in its Drop method (as it
wouldn't be dropped so long as the VCPU threads were around).

Another complexity surrounds defining the difference between an
exited thread (from which an exit event can be read out of the
channel) and a finished thread (which tears down the channel).
Trying to send a message to a thread which has finished will
panic, and some CPUs might be in the exit state while others may
not...and reading the "exit state" by taking the message out of
the channel removes it.  It's a bit complicated.

This commit tries to move toward simplification by being explicit
about the Vmm.stop() call being responsible for exiting the VCPUs,
and making the VcpuHandle's Drop method only do the simple task
of joining an exited thread.
@AE1020 AE1020 force-pushed the valgrind-clean-shutdown branch 3 times, most recently from fc1c9bf to c63c22c Compare May 1, 2021 22:32
@acatangiu
Copy link
Contributor

acatangiu commented May 4, 2021

I looked at the proposed changes and I like the consistent and structured teardown mechanisms added here 👍

However, there is a part of the story which this PR doesn't consider - it is also the reason for the current abrupt shutdown mechanism:

Seccomp/syscall filtering: Firecracker has multiple layers of security hardening and workload isolation [1] - one of these layers is syscall filtering through seccomp functionality. The default/recommended filter for production workloads is a tight allow-list that doesn't include some syscalls done by Rust during thread exit (I believe rt_sigprocmask is first to hit the filter) - meaning that when default seccomp filter is installed the clean teardown will result in a seccomp violation panic.
The PR often references teardown speed/latency as being a factor in the design - it is not, seccomp is - process teardown should not be on any critical path afaict.

Having runtime knob for (dis/en)abling the clean teardown mechanism introduces compatibility issues with the seccomp knob and filters, resulting in a complicated matrix of behavior depending on configuration.

Ideally we want to avoid that!

My recommendation on approaches to investigate (in order of preference):

  1. Take another hard look at the missing syscalls from our default/recommended seccomp allow-list, that prevent proper thread cleanup and exiting. If we consider them safe enough to be included, drop the abrupt exit path and do clean exiting along the lines of this PR. (For example, rt_sigprocmask looks safe to me for allow-list inclusion - it only changes signal mask for current thread, does not affect system or other processes.)
  2. Don't change seccomp filters, keep abrupt exit to be compatible with seccomp filters, add clean exit as a compile-time option and only use it for valgrind and other testing tools. (This option might result in very ugly code.)
  3. Even if we do none of the above and just forget about running valgrind, some of the cleanup work proposed by this PR (like fixing the Vmm<->Vcpu cyclic dependency) is still valuable and could go in regardless.

[1] - https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md#threat-containment

P.S. looks like unit-tests are hanging in current PR - do they pass for you locally?

In order to mitigate the risks of deadlock during shutdown in
production builds, this adds back the behavior of exiting the
process abruptly after writing metrics files.  This also offers
better performance by letting the OS deal with cleaning up
memory and resources.

This limits the behavior of unwinding all the way up to main()
and joining the threads to debug builds.  The policy could be
replaced by an independent flag (if someone thought it important
to run Valgrind on release builds, e.g. because it was too slow
using it with debug ones).
@AE1020
Copy link
Contributor Author

AE1020 commented May 4, 2021

Even if we do none of the above and just forget about running valgrind, some of the cleanup work proposed by this PR (like fixing the Vmm<->Vcpu cyclic dependency) is still valuable and could go in regardless.

The Vmm<->Vcpu issue suggests what I think is kind of a bigger point about the value of clean shutdown. It doesn't just let you check Valgrind leaks, it's a "systemic check of assumptions". Knowing you can unwind everything successfully helps build confidence that nothing got entangled while winding it up. It also means you can shutdown the system fully and start it back up within the same process when you have a reason to do so.

when default seccomp filter is installed the clean teardown will result in a seccomp violation panic.

Ah. Valgrind cannot run at all with seccomp filtering. So whenever I was running --cleanup-level 2, I happened to be also running it with --seccomp-level 0, thus did not get bit by this.

On the bright side: just changing that final moment during clean shutdown from std::process::exit() => unsafe { libc::_exit() }, only makes a tiny difference... 16 "still reachable" blocks instead of 10. That's not bad. This particular statistic will probably never go to zero, and hopefully won't grow on larger cases. So once you've read through them and deemed them ok you can make a suppression list.

Hence I've changed whether to call std::process::exit() based on whether seccomp filtering was enabled or not.

Having runtime knob for (dis/en)abling the clean teardown mechanism introduces compatibility issues with the seccomp knob and filters, resulting in a complicated matrix of behavior depending on configuration.

There already was such a matrix...but in a way that created mechanics that only existed in tests, with no way of happening in the actual program. See the #[cfg(test)] exit() state for VCPU and the #[cfg(not(test))] exit state for VCPU. That's good to get rid of.

Of course no matrix at all is even better. But not defaulting to forcing an exit creates risks of running into deadlocks if there are bugs in the shutdown process...which could be bad in deployment. I was trying to think of how to mitigate that.

It sounds like you would prefer this being a compile-time flag. So I've removed --cleanup-level and changed it so that whether exit() runs abruptly or not is controlled by #[cfg(debug_assertions)].

What you lose in this exchange is an easy way to do "valgrind on release build". Which could be a separate flag for those who want it. But making all debug builds shut down cleanly regardless means the behavior is part of a release/debug distinction that's already tested.

looks like unit-tests are hanging in current PR - do they pass for you locally?

I don't have a bare metal Linux machine at present, and am running inside of VirtualBox. Hence tests have tons of failures on the mainline branch, and I need the CI output generated by the PR to see the "real" test failures. (Additionally, my laptop may have been good when I got it, but it's 5 years old now, and it's a laptop running VMs inside of VMs...so using your CI gives significant turnaround benefit; I'm not sure how long running the test would take, as I've never finished it. Hours?)

The hanging failure is a very generic message of "test runner failed" that points to the top level of something. I don't know how to translate that into an invocation I can run myself and debug. If you can provide a step-by-step of how to turn an error like that into something actionable, I could look at it.

Adding the shutdown code to the debug build seems to have caused performance regression. Are debug builds being tested for their shutdown times?

@AE1020 AE1020 changed the title --cleanup-level switch for Valgrind-friendly exit mode Clean Shutdown on Debug Builds (Valgrind-friendly) May 4, 2021
@acatangiu
Copy link
Contributor

acatangiu commented May 6, 2021

Let me take a look at this:

Take another hard look at the missing syscalls from our default/recommended seccomp allow-list, that prevent proper thread cleanup and exiting. If we consider them safe enough to be included, drop the abrupt exit path and do clean exiting along the lines of this PR. (For example, rt_sigprocmask looks safe to me for allow-list inclusion - it only changes signal mask for current thread, does not affect system or other processes.)

Ideal state would be to have clean teardown on release builds as well and not juggle multiple binaries. Let's aim for ideal scenario first 😄

In the meantime, @firecracker-microvm/compute-capsule please take a look at the proposed changes for bubbling up errors, retcodes, exit reasons, etc.

@andreeaflorescu if you get a chance, please take a look at the polly changes. Would they be compatible with https://github.com/rust-vmm/event-manager/ ?

@acatangiu
Copy link
Contributor

acatangiu commented May 12, 2021

@AE1020 I've looked into filters and found that we can safely allow-list the missing few required for clean teardown.
Please see commit acatangiu@e40b1e2 which is on top of your changes and allows clean teardown even with seccomp filtering enabled.

In this context, I suggest we keep your clean teardown changes but remove all branching between debug and release builds. All builds can just do clean teardown. Wdyt?

@andreeaflorescu
Copy link
Member

@andreeaflorescu if you get a chance, please take a look at the polly changes. Would they be compatible with https://github.com/rust-vmm/event-manager/ ?

This PR introduces a big chance, and I would need more time to provide a helpful review.

A few points that I've observed looking quickly through the code:

  • I noticed some asserts in the polly implementation; those should be avoided in library crates
  • We're using ExitCode which is an u8; that is atypical for handling Rust exits though in library crates; I was expecting some sort of error to be returned instead

We're implementing clean exit in rust-vmm/vmm-reference (it still has some loose ends that we need to sort out), but we're doing that using the event-manager existing interface. We didn't encounter any problems so far. I am not sure if this is the same problem that you're trying to solve here, but we needed a way to exit from the event manager run loop. This was implemented through an ExitHandler which we check before re-entering the loop. The code is here: https://github.com/rust-vmm/vmm-reference/blob/master/src/vmm/src/lib.rs#L282

@acatangiu
Copy link
Contributor

@andreeaflorescu I didn't formulate my question properly, sorry.

Without going into detailed changes of polly in this PR, I am more interested in whether you think conceptually event-manager could have user-provided Errors/break-conditions. That way a Subscriber can also bubble up errors and or special return values originating in their event handling.

Doing so would avoid needing side-channels communication like this: https://github.com/rust-vmm/vmm-reference/blob/master/src/vmm/src/lib.rs#L282 ...

Anyway, I think we should move this conversation to a feature-request issue on https://github.com/rust-vmm/event-manager or similar.

FYI @AE1020 question above is not a blocker for current PR, it's us thinking ahead to when we switch polly for rust-vmm/event-manager.

@andreeaflorescu
Copy link
Member

It seems like a fair request to be able to break out of the event run loop. We can discuss about how to design this in rust-vmm/event-manager as you suggested @acatangiu.

@AE1020
Copy link
Contributor Author

AE1020 commented May 16, 2021

I'm certainly fine with however you see fit to change it. (This was just my first attempt to program in Rust, which I wanted to do on a project where the security aspects were relevant to why the language was chosen. And I wanted it to be something real instead of just make-work. It has thus served the educational purpose I had...I don't mind if you want to take over and do it a totally different way.)

As the comments mention, the HTTP "/shutdown-internal" request I added is really just a hack because there was no other way to stop the API thread. If there were some other channel of communication and shutdown could be negotiated without that, a real user-exposed endpoint to ask for shutdown seems a useful thing in general.

I'm actually in the middle of moving this week 🚚, and have some other coding commitments.

But if you have something specific you'd like me to look at let me know. As I mentioned, I don't really know how to create a command line corresponding to a test failure to debug. And it would all be different if the shutdown mechanism is changed.

@sandreim
Copy link
Contributor

sandreim commented May 20, 2021

I agree with @andreeaflorescu that breaking the eventloop is a fair feature request. Clean shutdown is something that has clear value outside the scope of Valgrind friendlyness, but I think requires more design work as it does touch external components like EventManager. We should have that discussion in https://github.com/rust-vmm/event-manager as we plan to use it as a Polly replacement.

@AE1020 @acatangiu WDYT about a less invasive approach here like panicing! all threads so destructors are still called. We should check how memory usage is registered in Valgrind if we exit the VMM this way.

@acatangiu
Copy link
Contributor

@AE1020 @acatangiu WDYT about a less invasive approach here like panicing! all threads so destructors are still called. We should check how memory usage is registered in Valgrind if we exit the VMM this way.

I am working on a variation of this PR that does clean shutdown [1]. The actual end result should be an improvement in firecracker code complexity not a degradation since things then naturally fit the RAII principle and are easier to mentally model.

Before publishing the PR I want to also add rust sanitizer tests which would be available for us to run when doing clean teardown; there is still some work left to do, and it's currently on the back-burner, I'm estimating to have an RFC end of next week.

[1] https://github.com/acatangiu/firecracker/tree/clean-shutdown

@andreeaflorescu
Copy link
Member

@acatangiu it looks like we are trying to achieve the same things in Firecracker, so I was wondering if you got the chance to look at how clean shut down is implemented in vmm-reference? At a first glance it looks like it is fairly easy to port the changes in Firecracker, and we wouldn't need to implement it from scratch. The vmm-reference is not using all the constructs that Firecracker does (i.e. it does not yet implement a way of sending messages between vCPUs), but we are planning to add this support when upstreaming vm-vcpu. Since it looks like clean shut down is something that we want to merge in Firecracker faster than we would consume something like vm-vcpu, I wonder if we could still align the implementations, or maybe clearly state why they're incompatible.

@acatangiu
Copy link
Contributor

Closing this in favor of #2599

@acatangiu acatangiu closed this May 28, 2021
@acatangiu
Copy link
Contributor

acatangiu commented May 28, 2021

Since it looks like clean shut down is something that we want to merge in Firecracker faster than we would consume something like vm-vcpu, I wonder if we could still align the implementations, or maybe clearly state why they're incompatible.

@andreeaflorescu please see #2599 : I have used the same model of breaking the main event loop as used by vmm-reference, so the two implementations are now aligned from that perspective. Regarding Vcpu exit handler, I've chosen to keep a simple EventFd instead of the complicated generic construct vmm-reference is using that allows customizable exit methods - even so, the two implementations are equivalent so we can easily consume the generic mechanism in upstream vm-vcpu when it will be available.

L.E. everything else is Firecracker-specific IMO (api logic and dedicated thread, rpc-interface, vmm-vcpu messaging, integration tests, seccomp filters, application glue code, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants