-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Use RX mergeable buffers for net device #2672
Conversation
This allows the guest driver to manage RX virtqueue memory efficiently removing the need to allocate descriptor chains that can hold ~65 KB of packet data. The device will now use multiple descriptor chains to write the packet to the guest. Signed-off-by: Alexandru-Cezar Sardan <alsardan@amazon.com>
This triggers the driver to process and replenish the available ring when the device used half of the ring. This helps in high traffic scenarios to avoid getting an EmptyQueue error early. Signed-off-by: Alexandru-Cezar Sardan <alsardan@amazon.com>
Somewhat orthogonal question: Why are our tests a mix of I see regression is only on a few testcases that run |
@acatangiu to answer your second question, there were some changes added to virtio-net in kernel 4.12 that improve performance on the mergeable RX buffers path (see torvalds/linux@680557c). There's also XDP support which I don't think our tests use anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting PR and the results look great!
I only took a superficial look at it, leaving some nits and superficial comments, will come back with an in-depth review.
} | ||
|
||
pub(crate) fn vnet_hdr_len() -> usize { | ||
mem::size_of::<virtio_net_hdr_v1>() | ||
} | ||
|
||
pub const VNET_HDR_NUM_BUFFERS_OFFSET: usize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move this to some definitions file if there is any for virtio or virtio-net, or at least move this to the top of this file and add a comment where it's defined (either virtio standard or linux header or smth)
@@ -228,6 +245,11 @@ impl Net { | |||
self.mmds_ns.is_some() | |||
} | |||
|
|||
// Check if driver uses mergeable RX buffers. | |||
pub fn use_mrg_rx_buffs(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
does this need to be pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would define this more generically as a VirtioDevice
method. Something like has_feature(feature)
.
METRICS.net.rx_bytes_count.add(frame_len); | ||
METRICS.net.rx_packets_count.inc(); | ||
|
||
self.num_deffered_irqs += used_buffs.len() as u16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename num_deffered_irqs
to smth like used_bytes_pending_irq
or similar.
|
||
// When using mergeable RX buffers, the driver usually adds one descriptor per chain. | ||
// When using big buffers the driver must add MAX_BUFFER_SIZE / PAGE_SIZE buffers per chain. | ||
let mut desc_chain_wbuffs = if use_mrg { Vec::with_capacity(1) } else { Vec::with_capacity(17) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's define a constant here derived from MAX_BUFFER_SIZE
:
let mut desc_chain_wbuffs = if use_mrg { Vec::with_capacity(1) } else { Vec::with_capacity(17) }; | |
let mut desc_chain_wbuffs = if use_mrg { Vec::with_capacity(1) } else { Vec::with_capacity(MAX_BUFFER_PAGES) }; |
queue.undo_multiple_pop(1+buffs.len() as u16); | ||
buffs.clear(); | ||
return Err(FrontendError::BrokenQueue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this cleanup sequence is repeated a couple of times, we could define a fn or closure for it
fn reserve_guest_wbuffers( | ||
&mut self, | ||
buffs: &mut Vec<(u16, Vec<(GuestAddress, u32)>)> | ||
) -> std::result::Result<(), FrontendError> { | ||
let mem = match self.device_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mem
could come in as a param here to avoid match
ing state for it again
// Descriptor chain must be able to hold the vnet header at the very least. | ||
if desc_chain_wlen >= vnet_hdr_len() { | ||
total_reserved_sz += desc_chain_wlen; | ||
buffs.push((head_descriptor.index, desc_chain_wbuffs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be nice if we could somehow avoid the intermediate vec
fn signal_rx_used_queue_throttled(&mut self) -> result::Result<(), DeviceError> { | ||
// Signal the other side when we used half of the RX queue | ||
if self.num_deffered_irqs > self.queues[RX_INDEX].actual_size()/2 { | ||
return self.signal_used_queue(); | ||
} | ||
|
||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that this will not be needed anymore after adding support for notification suppression. AFAIU, when notif suppression is negociated the guest net driver asks for notifications when 3/4 of the RX queue buffers were used, which resembles what we do here. It would be worth testing this.
@@ -228,6 +245,11 @@ impl Net { | |||
self.mmds_ns.is_some() | |||
} | |||
|
|||
// Check if driver uses mergeable RX buffers. | |||
pub fn use_mrg_rx_buffs(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would define this more generically as a VirtioDevice
method. Something like has_feature(feature)
.
|
||
// Gets a list of writtable buffers contained in this descriptor chain. | ||
// Returns an error if descriptor chain is malformed. | ||
pub fn get_wo_buffs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems specific to the net device. I wouldn't define it in the Queue implementation. An option would be to create a separate struct that contains the net queues and maybe the rate limiters or other components (VirtioFrontend
for example) and define it there. Although this change would be quite big. Or we can simply define this method in the net device.
@@ -173,6 +173,17 @@ impl<'a> VirtqDesc<'a> { | |||
.is_ok()); | |||
} | |||
|
|||
pub fn check_data_from_offset(&self, expected_data: &[u8], offset: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't add a separate method for this. I would just extend check_data()
and provide offset = 0
to all the old use cases.
@@ -381,6 +429,44 @@ impl Queue { | |||
.map_err(QueueError::UsedRing) | |||
} | |||
|
|||
/// Puts multiple available descriptor heads into the used ring for use by the guest. | |||
pub fn add_used_bulk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reimplement add_used()
by calling add_used_bulk()
with only 1 descriptor ? In order to avoid duplicated code.
Will resume work on this PR in 2022. |
This PR is stale for a while so I'm closing it for the moment. We'll reopen this PR after we refactor the changes on top of the current release. |
Reason for This PR
This is an RFC for implementing mergeable RX buffers for the net device.
Snapshot tests fail since Virtio features need to be negotiated.
Addresses #1314
Description of Changes
This patch enables the net device to use mergeable RX buffers instead of big buffers.
Mergeable RX buffers is a virtio-net feature that allows the device to use multiple descriptor chains to write the RX packet to instead of just one. The number of chains used is specified in the virtio-net header attached to each packet.
The specification of this Virtio feature can be consulted here: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-2140004
Without this feature, the driver allocates 65562 Bytes (17 4K pages) in a single descriptor chain. This means that if the underlying tap device used to receive packets has smaller MTU, the difference will always be wasted.
With mergeable RX buffers turned on, Linux virtio-net drivers will adjust the size of the buffers allocated based on the size of previous received packets in order to conserve memory. This allows the driver to use a different allocator (skb_frag allocator) instead of reserving full pages.
Performance
Average CPU utilisation of the vmm increases with 7-12% with host to guest tests that use DEFAULT or 256K window sizes keeping same baseline throughput performance.
Tests with improved throughput and vCPU utilisation:
Tests with throughput regression:
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws256K-bd:
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws256K-bd:
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-h2g:
This functionality can be added in
rust-vmm
.License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
PR Checklist
[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]
git commit -s
).unsafe
code is properly documented.firecracker/swagger.yaml
.CHANGELOG.md
.