Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify microvm builder code and make uffd test failures easier to debug #4991

Merged
merged 5 commits into from
Jan 16, 2025

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Jan 10, 2025

This PR is an assortment of 5 small changes that are only connected by the fact that I ran into 4 of them while playing around with guest_memfd in Firecracker (the odd one out is the UFFD one, which is related to #4601). Roughly, we can group these changes into 3 categories:

  • Remove an unneeded .clone() operation
  • Ensure that a panic in the example UFFD handlers does not cause pytest to hang for 5 minutes because suddenly it's waiting indefinitely for page faults to be served, by having it take down Firecracker with it (and improvement of error messages in this case)
  • Refactor some handling behind the /machine-config endpoint because the duplication of the same data/fields across VmConfig/MachineConfig was already confusing me when working on huge pages, and now that I had to touch this endpoint again I've reached my breaking point 😂

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 92.24138% with 9 lines in your changes missing coverage. Please review.

Project coverage is 83.06%. Comparing base (c182544) to head (4131a3d).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/resources.rs 78.26% 5 Missing ⚠️
src/vmm/src/builder.rs 88.88% 2 Missing ⚠️
src/vmm/src/persist.rs 87.50% 1 Missing ⚠️
src/vmm/src/vmm_config/machine_config.rs 97.82% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4991      +/-   ##
==========================================
- Coverage   83.07%   83.06%   -0.01%     
==========================================
  Files         244      244              
  Lines       26634    26658      +24     
==========================================
+ Hits        22125    22144      +19     
- Misses       4509     4514       +5     
Flag Coverage Δ
5.10-c5n.metal 83.58% <92.24%> (-0.02%) ⬇️
5.10-m5n.metal 83.56% <92.24%> (-0.02%) ⬇️
5.10-m6a.metal 82.77% <92.24%> (-0.02%) ⬇️
5.10-m6g.metal 79.43% <91.37%> (-0.02%) ⬇️
5.10-m6i.metal 83.55% <92.24%> (-0.02%) ⬇️
5.10-m7g.metal 79.43% <91.37%> (-0.02%) ⬇️
6.1-c5n.metal 83.57% <92.24%> (-0.02%) ⬇️
6.1-m5n.metal 83.55% <92.24%> (-0.02%) ⬇️
6.1-m6a.metal 82.77% <92.24%> (-0.02%) ⬇️
6.1-m6g.metal 79.43% <91.37%> (-0.01%) ⬇️
6.1-m6i.metal 83.55% <92.24%> (-0.02%) ⬇️
6.1-m7g.metal 79.42% <91.37%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roypat roypat force-pushed the misc-changes-7 branch 2 times, most recently from 60e6a40 to 8e83347 Compare January 10, 2025 15:11
@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jan 10, 2025
@roypat roypat changed the title Assorted bits and bops Simplify microvm builder code and make uffd test failures easier to debug Jan 15, 2025
@roypat roypat force-pushed the misc-changes-7 branch 3 times, most recently from ce01d8a to b0eecf3 Compare January 16, 2025 08:04
kalyazin
kalyazin previously approved these changes Jan 16, 2025
@roypat roypat requested a review from bchalios January 16, 2025 10:14
Copy link
Contributor

@bchalios bchalios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions regarding backwards compatibility of the swagger changes. The rest of comments are mostly nits.

Also, codecov is complaining about the new functions that help us handle the messiness of static CPU templates. Anything we can do about these?

@roypat roypat force-pushed the misc-changes-7 branch 2 times, most recently from f56bf06 to b7e03e5 Compare January 16, 2025 11:19
@roypat roypat requested review from bchalios and kalyazin January 16, 2025 11:20
kalyazin
kalyazin previously approved these changes Jan 16, 2025
We never make us of this, and I do not see where this ever _could_ be
useful.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
This makes these fields optional in PATCH /machine-config requests. The
comment on this structure says that all fields should be optional, and I
dont quite see why these two should be different. Thus, add
`serde(default)` to avoid forcing customers to explicitly set them to
`null` if they do not want to update these parts of the machine config.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
This struct is almost a 1:1 duplication of `struct MachineConfig`, with
only a single deviation when it comes to CPU template handling (see
below). This makes it very annoying to add new fields to the
/machine-config endpoint, because counter-intuitively we have to
hand-edit at least 3 structs to add new fields to.

We can get rid of this duplication by merging VmConfig and MachineConfig
into just `MachineConfig` (that's what the endpoint is called, so having
the struct be the same makes sense). We now need to handle a bit of
nasty-ness when it comes to CPU templates, because /machine-config can
only be used for specifying static cpu templates, while a VmConfig is
used to hold whatever CPU template is stored, in whatever way. However,
we can handle this at the serde layer, by making serialize/deserialize
ignore the field if it doesnt contain a static template. This is ugly,
but since static templates are deprecated, we have line of sight to
getting rid of this weirdness when we release 2.0.

While we're at it, opportunistically rename functions etc to uniformly
call this thing a "machine config" instead of a "vm config".

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
If the UFFD handler exits abnormaly for some reason, have it take down
Firecracker as well by SIGKILL-ing it from a panic hook. For this,
reintroduce the "get peer creds" logic. We have to use SIGKILL because
Firecracker could be inside the handler for a KVM-originated page fault
that is not marked as interruptible, in which case all signals but
SIGKILL are ignored (happens for example during KVM_SET_MSRS when it
triggers the initialization of a gfn_to_pfn_cache for the kvm-clock
page, which uses GUP without FOLL_INTERRUPTIBLE).

While we're at it, add a hint to the generic "process not found" error
message to indicate that potentially Firecracker died, and that the
cause of this could be the UFFD handler crashing (for example, in firecracker-microvm#4601
the cause of the mystery hang is the UFFD handler crashing, but we were
stumped by what's going on for over half a year. Let's avoid that going
forward).

We can't enable this by default because it interferes with unittests,
and also the "malicious_handler", so expose a function on `Runtime` to
enable it only in valid_handler and fault_all_handler.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
There is no need to clone the GuestMemoryMmap here, as
create_vmm_and_vcpus returns it again (as part of the Vmm object), and
since later code in build_microvm_from_snapshot doesn't need to take
ownership of the GuestMemoryMmap, we can just use references to this
stored object, avoiding the clone.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
@roypat roypat requested a review from kalyazin January 16, 2025 13:51
@roypat roypat merged commit 3fb06e9 into firecracker-microvm:main Jan 16, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants