-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean Shutdown on Debug Builds (Valgrind-friendly) #2535
Conversation
4434d2c
to
deb7b04
Compare
The design of the existing main() function was such that Firecracker never unwound the stack to the top level in order to end the program. Instead, it would run some Firecracker-specific shutdown code and then call `unsafe { libc::_exit() }`. That skips some Rust-specific shutdown done by std::process::exit(). But even using that does not call destructors for the stack objects, so it is not a "clean shutdown": https://doc.rust-lang.org/std/process/fn.exit.html While a "dirty shutdown" can be a bit faster, the benefit of having non-panic shutdowns run their finalization is that it helps with the accounting of Valgrind and sanitizers. It's a good sanity check to make sure the program could exit cleanly if it wanted to. This commit attempts a somewhat minimally-invasive means of bubbling up an "ExitCode". Exiting was managed uniquely by the Vmm event subscriber during the process() callback, and requires running the vmm.stop() method. So this extends the Subscriber trait with an extended form of callback named process_exitable() that returns an optional exit code. The event loop doesn't have a clear (to me) way to get at the VMM to call stop() if a non-Vmm subscriber requests an exit. So the interface means any other subscriber that overrode process_exitable() would miss the required stop code. So it suggests the need for a redesign by someone more familiar with the architecture. However, it does accomplish a proof of concept by getting closer to a clean bill of health from a Valgrind analysis.
The general presumption for Firecracker was that it would be exited via an exit() call. This meant there was no way to signal threads to exit their loops so they could be join()'d to release their resources. Not being able to exit the stack normally (if one wants to) misses a sanity check that shutting down cleanly provides. The output of Valgrind and other tools is also less useful. Because the API Server thread is in a loop that does a blocking wait on a socket, the only way to tell it to stop looping is via information on that socket. This uses an internal HTTP GET signal to request that shutdown. Rather than forget the thread handle and the socket bind location, the main thread now sends the shutdown request and then waits while the thread joins. The request is for internal use only, however--because it is supposed to happen after the Vmm has been cleaned up and its contents destructed. A different approach for signaling the thread for this purpose could be used, e.g. to have the API server thread waiting on multiple events (the socket or this internal signal). But since this is only needed for clean shutdown, it is probably good enough. A faster shutdown using exit() would be a better default--this will likely only be used for debugging and Valgrind/sanitization.
This continues the work from the prior commit toward the goal of releasing all resources with a clean shutdown. The VCPU threads in non-test configurations were running with an unbreakable wait, so they only way to get past them was with exit(): // Wait indefinitely // The VMM thread will kill the entire process let barrier = Barrier::new(2); barrier.wait(); Here that is changed, along with several other entangled issues. One was that Drop methods were used to try and clean up the Vmm and the VcpuHandles, but these were not being called. The reason is that there was a cyclic dependency: threads held onto the Vmm, so the Vmm could not exit VCPU threads in its Drop method (as it wouldn't be dropped so long as the VCPU threads were around). Another complexity surrounds defining the difference between an exited thread (from which an exit event can be read out of the channel) and a finished thread (which tears down the channel). Trying to send a message to a thread which has finished will panic, and some CPUs might be in the exit state while others may not...and reading the "exit state" by taking the message out of the channel removes it. It's a bit complicated. This commit tries to move toward simplification by being explicit about the Vmm.stop() call being responsible for exiting the VCPUs, and making the VcpuHandle's Drop method only do the simple task of joining an exited thread.
fc1c9bf
to
c63c22c
Compare
I looked at the proposed changes and I like the consistent and structured teardown mechanisms added here 👍 However, there is a part of the story which this PR doesn't consider - it is also the reason for the current abrupt shutdown mechanism: Seccomp/syscall filtering: Firecracker has multiple layers of security hardening and workload isolation [1] - one of these layers is syscall filtering through seccomp functionality. The default/recommended filter for production workloads is a tight allow-list that doesn't include some syscalls done by Rust during thread exit (I believe Having runtime knob for (dis/en)abling the clean teardown mechanism introduces compatibility issues with the seccomp knob and filters, resulting in a complicated matrix of behavior depending on configuration. Ideally we want to avoid that! My recommendation on approaches to investigate (in order of preference):
P.S. looks like unit-tests are hanging in current PR - do they pass for you locally? |
In order to mitigate the risks of deadlock during shutdown in production builds, this adds back the behavior of exiting the process abruptly after writing metrics files. This also offers better performance by letting the OS deal with cleaning up memory and resources. This limits the behavior of unwinding all the way up to main() and joining the threads to debug builds. The policy could be replaced by an independent flag (if someone thought it important to run Valgrind on release builds, e.g. because it was too slow using it with debug ones).
c63c22c
to
314da6c
Compare
The Vmm<->Vcpu issue suggests what I think is kind of a bigger point about the value of clean shutdown. It doesn't just let you check Valgrind leaks, it's a "systemic check of assumptions". Knowing you can unwind everything successfully helps build confidence that nothing got entangled while winding it up. It also means you can shutdown the system fully and start it back up within the same process when you have a reason to do so.
Ah. Valgrind cannot run at all with seccomp filtering. So whenever I was running On the bright side: just changing that final moment during clean shutdown from
There already was such a matrix...but in a way that created mechanics that only existed in tests, with no way of happening in the actual program. See the #[cfg(test)] exit() state for VCPU and the #[cfg(not(test))] exit state for VCPU. That's good to get rid of. Of course no matrix at all is even better. But not defaulting to forcing an exit creates risks of running into deadlocks if there are bugs in the shutdown process...which could be bad in deployment. I was trying to think of how to mitigate that. It sounds like you would prefer this being a compile-time flag. So I've removed What you lose in this exchange is an easy way to do "valgrind on release build". Which could be a separate flag for those who want it. But making all debug builds shut down cleanly regardless means the behavior is part of a release/debug distinction that's already tested.
I don't have a bare metal Linux machine at present, and am running inside of VirtualBox. Hence tests have tons of failures on the mainline branch, and I need the CI output generated by the PR to see the "real" test failures. (Additionally, my laptop may have been good when I got it, but it's 5 years old now, and it's a laptop running VMs inside of VMs...so using your CI gives significant turnaround benefit; I'm not sure how long running the test would take, as I've never finished it. Hours?) The hanging failure is a very generic message of "test runner failed" that points to the top level of something. I don't know how to translate that into an invocation I can run myself and debug. If you can provide a step-by-step of how to turn an error like that into something actionable, I could look at it. Adding the shutdown code to the debug build seems to have caused performance regression. Are debug builds being tested for their shutdown times? |
Let me take a look at this:
Ideal state would be to have clean teardown on release builds as well and not juggle multiple binaries. Let's aim for ideal scenario first 😄 In the meantime, @firecracker-microvm/compute-capsule please take a look at the proposed changes for bubbling up errors, retcodes, exit reasons, etc. @andreeaflorescu if you get a chance, please take a look at the |
@AE1020 I've looked into filters and found that we can safely allow-list the missing few required for clean teardown. In this context, I suggest we keep your clean teardown changes but remove all branching between |
This PR introduces a big chance, and I would need more time to provide a helpful review. A few points that I've observed looking quickly through the code:
We're implementing clean exit in rust-vmm/vmm-reference (it still has some loose ends that we need to sort out), but we're doing that using the event-manager existing interface. We didn't encounter any problems so far. I am not sure if this is the same problem that you're trying to solve here, but we needed a way to exit from the event manager run loop. This was implemented through an |
@andreeaflorescu I didn't formulate my question properly, sorry. Without going into detailed changes of Doing so would avoid needing side-channels communication like this: https://github.com/rust-vmm/vmm-reference/blob/master/src/vmm/src/lib.rs#L282 ... Anyway, I think we should move this conversation to a feature-request issue on https://github.com/rust-vmm/event-manager or similar. FYI @AE1020 question above is not a blocker for current PR, it's us thinking ahead to when we switch |
It seems like a fair request to be able to break out of the event run loop. We can discuss about how to design this in rust-vmm/event-manager as you suggested @acatangiu. |
I'm certainly fine with however you see fit to change it. (This was just my first attempt to program in Rust, which I wanted to do on a project where the security aspects were relevant to why the language was chosen. And I wanted it to be something real instead of just make-work. It has thus served the educational purpose I had...I don't mind if you want to take over and do it a totally different way.) As the comments mention, the HTTP "/shutdown-internal" request I added is really just a hack because there was no other way to stop the API thread. If there were some other channel of communication and shutdown could be negotiated without that, a real user-exposed endpoint to ask for shutdown seems a useful thing in general. I'm actually in the middle of moving this week 🚚, and have some other coding commitments. But if you have something specific you'd like me to look at let me know. As I mentioned, I don't really know how to create a command line corresponding to a test failure to debug. And it would all be different if the shutdown mechanism is changed. |
I agree with @andreeaflorescu that breaking the eventloop is a fair feature request. Clean shutdown is something that has clear value outside the scope of Valgrind friendlyness, but I think requires more design work as it does touch external components like EventManager. We should have that discussion in https://github.com/rust-vmm/event-manager as we plan to use it as a Polly replacement. @AE1020 @acatangiu WDYT about a less invasive approach here like panicing! all threads so destructors are still called. We should check how memory usage is registered in Valgrind if we exit the VMM this way. |
I am working on a variation of this PR that does clean shutdown [1]. The actual end result should be an improvement in firecracker code complexity not a degradation since things then naturally fit the RAII principle and are easier to mentally model. Before publishing the PR I want to also add rust sanitizer tests which would be available for us to run when doing clean teardown; there is still some work left to do, and it's currently on the back-burner, I'm estimating to have an RFC end of next week. [1] https://github.com/acatangiu/firecracker/tree/clean-shutdown |
@acatangiu it looks like we are trying to achieve the same things in Firecracker, so I was wondering if you got the chance to look at how clean shut down is implemented in vmm-reference? At a first glance it looks like it is fairly easy to port the changes in Firecracker, and we wouldn't need to implement it from scratch. The vmm-reference is not using all the constructs that Firecracker does (i.e. it does not yet implement a way of sending messages between vCPUs), but we are planning to add this support when upstreaming vm-vcpu. Since it looks like clean shut down is something that we want to merge in Firecracker faster than we would consume something like vm-vcpu, I wonder if we could still align the implementations, or maybe clearly state why they're incompatible. |
Closing this in favor of #2599 |
@andreeaflorescu please see #2599 : I have used the same model of breaking the main event loop as used by vmm-reference, so the two implementations are now aligned from that perspective. Regarding Vcpu exit handler, I've chosen to keep a simple EventFd instead of the complicated generic construct vmm-reference is using that allows customizable exit methods - even so, the two implementations are equivalent so we can easily consume the generic mechanism in upstream vm-vcpu when it will be available. L.E. everything else is Firecracker-specific IMO (api logic and dedicated thread, rpc-interface, vmm-vcpu messaging, integration tests, seccomp filters, application glue code, etc). |
#119 expresses the desire to use Valgrind. However, Firecracker's only modality of exiting the process was from within nested stacks using
unsafe { libc::_exit(...) }
from the main thread. This would mean not running destructors for anything on other threads, or above the stack on the main thread. It also means that there may be many shutdown paths which have faulty logic and are never being tested. (Such was the case withVmm's Drop
andVcpuHandle's Drop
, which could never run due to reference cycles.)However, abrupt exits are faster in release build situations, where one is not looking for leaks. Also, adding clean exit mechanics means creating new code paths which may deadlock and not quit as reliably.
This PR attempts to bring the best of both approaches
by adding aby running a Valgrind-friendly exit mode in debug builds.--cleanup-level
switchThe result of Valgrind on running the Alpine Linux image through boot and shutdown is:
Here are backtraces of the still reachable allocations. They may not represent leaks, and should be in a suppression list if they don't. As examples get larger, having a baseline gives much better odds of discerning what's a leak and what is not.
Description of Changes
The changes are broken into 4 parts, each with a commit description detailing what they do:
Addition of a--cleanup-level
switch, reintroducing the original exit semantics as the default behavior.#[cfg(debug_assertions)]
annotations.Open questions and comments are inline in the code.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
PR Checklist
git commit -s
).unsafe
code is properly documented.firecracker/swagger.yaml
.CHANGELOG.md
.Regarding testing, my VirtualBox is very bad at that... sending PR for discussion and so that CI runs it.