-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align device events registration after device activation #1687
Align device events registration after device activation #1687
Conversation
Still need to add unit tests for the new code. |
ccd1383
to
249d674
Compare
Tests added, PR is complete. |
249d674
to
9bfb07e
Compare
04f3317
to
ad12fa5
Compare
self_subscriber.clone(), | ||
) | ||
.unwrap_or_else(|e| { | ||
error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to continue the execution if we cannot register the events needed for this device? It might lead to silent failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that once a guest is booted (customer workload has possibly started) we should always avoid crashing.
A malfunctioning device has the blast-radius limited to operations on that device. The guest workload might or might not be affected and data might be recoverable if the malfunction is detected.
DeviceState::Inactive => warn!( | ||
"Block: The device is not yet activated. Spurious event received: {:?}", | ||
source | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error case can be tested and should probably be tested. We can check that we do not trigger process_*_event
functions when the device is not activated. This applies to both block and net devices. It looks like the activate for vsock device is not tested at all, so maybe we should open an issue for that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will add a test for this error case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test for the vsock device event handler.
Enhanced handler tests for all devices to validate correct events handling for both pre- and post-activation.
use vm_memory::{Bytes, GuestAddress}; | ||
|
||
#[test] | ||
fn test_interest_list() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does not bring much value because interest_list
just returns a static vector created in the function body. To test this basically means to duplicate the code in the test. This is typically a bad engineering practice referred to as WET. You can read more about it here: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that, but we should be consistent. We have a lot of tests doing exactly this (think errors formatting).
I am in favor of removing all such tests to be honest, but when they were added the logic was test brings an extra layer of redundancy that would catch unintentional changes
.
@firecracker-microvm/compute-capsule I'd really like for us to align here and maybe even formalize it in a best-practices doc in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should be consistent. If people from the team agree with this, we shouldn't introduce this test in this PR just for the sake of consistency though :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the WET tests.
Hi everyone! Taking a step back, it seems a lot complexity stems from the multi-step Firecracker configuration process. Is there any reason for devices, memory, mmds, etc. to be configured via separate API calls at this point? If the API only allowed the "one call" configuration we talked about a while ago (or, to push things even further, Firecracker is always started using a config file and only keeps the API for runtime PATCHes, etc), quite a lot of logic gets eliminated and we get configuration information upfront. This is a breaking change, but it seems worthwhile to seriously consider sooner rather than later. Also wanted to mention the |
I agree with @alexandruag that the current multi-step configuration process introduces a complexity cost and imposes limitations on the Firecracker design that far outweigh any benefits it brings in a production scenario. Unfortunately, like Alex mentioned, removing the multi-step configuration is a breaking change in terms of API and usage patterns and needs research and buy-in from customers/users of Firecracker before we can do it. In the meantime, we should move forward with the changes in this PR as their blast-radius is very small and reverting the way memory is passed to the device can be easily done in the future. By that time we might even have an agreed-upon interface in rust-vmm that we can also adhere to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Please add unit tests in */event_handler.rs which verifies that when the device is inactive, there is no event handling.
8945700
to
1707be1
Compare
|
||
// Push a queue event | ||
// - the driver has something to send (there's data in the TX queue); and | ||
// - the backend has no pending RX data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment says that the backend has no pending RX data, but in the code block below, you do:
device.backend.set_pending_rx(true)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I had updated the test but forgot to update the comment.
Fixed.
_ if source == evq => raise_irq = self.handle_evq_event(event), | ||
_ if source == backend => { | ||
raise_irq = self.notify_backend(event); | ||
match self.device_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be using is_activated
function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've made the changes so all devices now use is_activated()
in their event handler.
@@ -161,11 +160,16 @@ impl Block { | |||
} | |||
|
|||
pub(crate) fn process_queue(&mut self, queue_index: usize) -> bool { | |||
let mem = match self.device_state { | |||
DeviceState::Activated(ref mem) => mem, | |||
// This should never happen, it's been already validated in the event handler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this should never happen, should we use a panic instead to avoid programming errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm usually very apprehensive of adding crash conditions because of the risk of crashing in production.
In this area however, we have a lot of unit-tests so any future programming errors will be better caught with a panic as you say. Also because of the high-degree of coverage, I believe it is safe to say that we shouldn't crash in prod while tests are passing.
Per your suggestion, I've modified all instances of this check to have unreachable!()
on the unreachable path.
"Block: The device is not yet activated. Spurious event received: {:?}", | ||
source | ||
), | ||
}; | ||
} | ||
|
||
// Returns the rate_limiter and queue event fds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Returns the rate_limiter and queue event fds. | |
// Returns the activate event fd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also remove the comment, as it's not essential and requires maintenance with each change of the function's meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ded4139
to
88ee2bb
Compare
Instead of passing memory at device creation, bring it in during device activation. This enables future scenarios where devices can be created prior to guest memory configuration. Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Postpone block external events registration to device activation time. This makes the block device unaware of external events prior to its activation. During creation register a dedicated activation event which will notify the device when it's time to register the other external events sources. This activation event is unregistered after successful device activation. Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Postpone net external events registration to device activation time. This makes the net device unaware of external events prior to its activation. During creation register a dedicated activation event which will notify the device when it's time to register the other external events sources. This activation event is unregistered after successful device activation. Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Allow Vsock creation without guest memory. Access to memory will be given to the device during its activation. Signed-off-by: Adrian Catangiu <acatan@amazon.com>
Make sure all events are ignored prior to device activation. Added test for vsock event handling through EventManager. Signed-off-by: Adrian Catangiu <acatan@amazon.com>
88ee2bb
to
be73e7c
Compare
Reason for This PR
Fixes #1639
Description of Changes
Device access to guest memory on/after activation
Passing memory during device activation instead of during construction enables the following story: #1702 #1708 #1709 and finally #1713
The changes in this PR are in line with the model currently defined in https://github.com/rust-vmm/vm-virtio.
They are also compatible with the rust-vmm/vm-virtio#10 proposal where memory is also passed to the device during activation:
Events registration during activate
Postpone external events registration to device activation time.
This makes the block and net devices unaware of external events prior to their activation.
During creation register a dedicated activation event which will notify the device when it's time to register the other external events sources.
This activation event is unregistered after successful device activation.
Coverage slightly decreased because untestable EventFd error cases.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
PR Checklist
[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]
git commit -s
).unsafe
code is properly documented.firecracker/swagger.yaml
.CHANGELOG.md
.