Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align device events registration after device activation #1687

Merged
merged 6 commits into from
Apr 1, 2020

Conversation

acatangiu
Copy link
Contributor

@acatangiu acatangiu commented Mar 19, 2020

Reason for This PR

Fixes #1639

Description of Changes

Device access to guest memory on/after activation

Passing memory during device activation instead of during construction enables the following story: #1702 #1708 #1709 and finally #1713

The changes in this PR are in line with the model currently defined in https://github.com/rust-vmm/vm-virtio.
They are also compatible with the rust-vmm/vm-virtio#10 proposal where memory is also passed to the device during activation:

    /// Associated guest memory
    type M: GuestMemory;

    ...

    /// Activates this device for real usage.
    /// The ownership of the VirtioDeviceConfig object moves into the VirtioDevice object if
    /// activate succeeds, otherwise it should return the ownership back.
    fn activate(&mut self, config: VirtioDeviceConfig<M>) -> ActivateResult<VirtioDeviceConfig<M>>;

Events registration during activate

Postpone external events registration to device activation time.
This makes the block and net devices unaware of external events prior to their activation.

During creation register a dedicated activation event which will notify the device when it's time to register the other external events sources.

This activation event is unregistered after successful device activation.

Coverage slightly decreased because untestable EventFd error cases.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes are reflected in firecracker/swagger.yaml.
  • Any user-facing changes are mentioned in CHANGELOG.md.

@acatangiu acatangiu self-assigned this Mar 19, 2020
@acatangiu acatangiu added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 19, 2020
@acatangiu
Copy link
Contributor Author

Still need to add unit tests for the new code.

@acatangiu acatangiu added Status: Author Status: Awaiting review Indicates that a pull request is ready to be reviewed and removed Status: Awaiting review Indicates that a pull request is ready to be reviewed labels Mar 19, 2020
@acatangiu
Copy link
Contributor Author

Tests added, PR is complete.

src/devices/src/virtio/vsock/device.rs Show resolved Hide resolved
self_subscriber.clone(),
)
.unwrap_or_else(|e| {
error!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to continue the execution if we cannot register the events needed for this device? It might lead to silent failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that once a guest is booted (customer workload has possibly started) we should always avoid crashing.

A malfunctioning device has the blast-radius limited to operations on that device. The guest workload might or might not be affected and data might be recoverable if the malfunction is detected.

Comment on lines 81 to 84
DeviceState::Inactive => warn!(
"Block: The device is not yet activated. Spurious event received: {:?}",
source
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error case can be tested and should probably be tested. We can check that we do not trigger process_*_event functions when the device is not activated. This applies to both block and net devices. It looks like the activate for vsock device is not tested at all, so maybe we should open an issue for that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will add a test for this error case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for the vsock device event handler.

Enhanced handler tests for all devices to validate correct events handling for both pre- and post-activation.

use vm_memory::{Bytes, GuestAddress};

#[test]
fn test_interest_list() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not bring much value because interest_list just returns a static vector created in the function body. To test this basically means to duplicate the code in the test. This is typically a bad engineering practice referred to as WET. You can read more about it here: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that, but we should be consistent. We have a lot of tests doing exactly this (think errors formatting).

I am in favor of removing all such tests to be honest, but when they were added the logic was test brings an extra layer of redundancy that would catch unintentional changes.

@firecracker-microvm/compute-capsule I'd really like for us to align here and maybe even formalize it in a best-practices doc in the codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should be consistent. If people from the team agree with this, we shouldn't introduce this test in this PR just for the sake of consistency though :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the WET tests.

src/devices/src/virtio/device.rs Show resolved Hide resolved
@serban300 serban300 self-requested a review March 25, 2020 16:00
@alexandruag
Copy link
Contributor

Hi everyone! Taking a step back, it seems a lot complexity stems from the multi-step Firecracker configuration process. Is there any reason for devices, memory, mmds, etc. to be configured via separate API calls at this point? If the API only allowed the "one call" configuration we talked about a while ago (or, to push things even further, Firecracker is always started using a config file and only keeps the API for runtime PATCHes, etc), quite a lot of logic gets eliminated and we get configuration information upfront. This is a breaking change, but it seems worthwhile to seriously consider sooner rather than later.

Also wanted to mention the VirtioDevice interface from rust-vmm is still mostly based on the old crosvm/Firecracker code, so unfortunately it's unclear at this point how it will end up looking like. FWIW, not immediately going back to tying device memory configuration to activation might be better in terms of keeping options open.

@acatangiu
Copy link
Contributor Author

I agree with @alexandruag that the current multi-step configuration process introduces a complexity cost and imposes limitations on the Firecracker design that far outweigh any benefits it brings in a production scenario.

Unfortunately, like Alex mentioned, removing the multi-step configuration is a breaking change in terms of API and usage patterns and needs research and buy-in from customers/users of Firecracker before we can do it.

In the meantime, we should move forward with the changes in this PR as their blast-radius is very small and reverting the way memory is passed to the device can be easily done in the future. By that time we might even have an agreed-upon interface in rust-vmm that we can also adhere to.

Copy link

@iulianbarbu iulianbarbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Please add unit tests in */event_handler.rs which verifies that when the device is inactive, there is no event handling.

@acatangiu acatangiu force-pushed the device_activate branch 2 times, most recently from 8945700 to 1707be1 Compare March 30, 2020 19:52

// Push a queue event
// - the driver has something to send (there's data in the TX queue); and
// - the backend has no pending RX data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says that the backend has no pending RX data, but in the code block below, you do:
device.backend.set_pending_rx(true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I had updated the test but forgot to update the comment.
Fixed.

_ if source == evq => raise_irq = self.handle_evq_event(event),
_ if source == backend => {
raise_irq = self.notify_backend(event);
match self.device_state {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be using is_activated function instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've made the changes so all devices now use is_activated() in their event handler.

@@ -161,11 +160,16 @@ impl Block {
}

pub(crate) fn process_queue(&mut self, queue_index: usize) -> bool {
let mem = match self.device_state {
DeviceState::Activated(ref mem) => mem,
// This should never happen, it's been already validated in the event handler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this should never happen, should we use a panic instead to avoid programming errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm usually very apprehensive of adding crash conditions because of the risk of crashing in production.

In this area however, we have a lot of unit-tests so any future programming errors will be better caught with a panic as you say. Also because of the high-degree of coverage, I believe it is safe to say that we shouldn't crash in prod while tests are passing.

Per your suggestion, I've modified all instances of this check to have unreachable!() on the unreachable path.

"Block: The device is not yet activated. Spurious event received: {:?}",
source
),
};
}

// Returns the rate_limiter and queue event fds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Returns the rate_limiter and queue event fds.
// Returns the activate event fd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also remove the comment, as it's not essential and requires maintenance with each change of the function's meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Instead of passing memory at device creation, bring it in
during device activation.

This enables future scenarios where devices can be created
prior to guest memory configuration.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Postpone block external events registration to device activation time.
This makes the block device unaware of external events prior to
its activation.

During creation register a dedicated activation event which will notify
the device when it's time to register the other external events sources.

This activation event is unregistered after successful device
activation.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Postpone net external events registration to device activation time.
This makes the net device unaware of external events prior to
its activation.

During creation register a dedicated activation event which will notify
the device when it's time to register the other external events sources.

This activation event is unregistered after successful device
activation.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Allow Vsock creation without guest memory. Access to memory will
be given to the device during its activation.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Make sure all events are ignored prior to device activation.

Added test for vsock event handling through EventManager.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
@iulianbarbu iulianbarbu merged commit ce7a3d9 into firecracker-microvm:master Apr 1, 2020
@acatangiu acatangiu deleted the device_activate branch April 1, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Align device events registration after device activation
5 participants