Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Use RX mergeable buffers for net device #2672

Closed
wants to merge 2 commits into from

Conversation

alsrdn
Copy link

@alsrdn alsrdn commented Jul 21, 2021

Reason for This PR

This is an RFC for implementing mergeable RX buffers for the net device.
Snapshot tests fail since Virtio features need to be negotiated.

Addresses #1314

Description of Changes

This patch enables the net device to use mergeable RX buffers instead of big buffers.
Mergeable RX buffers is a virtio-net feature that allows the device to use multiple descriptor chains to write the RX packet to instead of just one. The number of chains used is specified in the virtio-net header attached to each packet.
The specification of this Virtio feature can be consulted here: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-2140004

Without this feature, the driver allocates 65562 Bytes (17 4K pages) in a single descriptor chain. This means that if the underlying tap device used to receive packets has smaller MTU, the difference will always be wasted.
With mergeable RX buffers turned on, Linux virtio-net drivers will adjust the size of the buffers allocated based on the size of previous received packets in order to conserve memory. This allows the driver to use a different allocator (skb_frag allocator) instead of reserving full pages.

Performance

Average CPU utilisation of the vmm increases with 7-12% with host to guest tests that use DEFAULT or 256K window sizes keeping same baseline throughput performance.

Tests with improved throughput and vCPU utilisation:

  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws16K-g2h:
    • throughput/total': Target: '2794 +- 139.7' vs Actual: '2945.32'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
    • throughput/total': Target: '27381 +- 3285.72' vs Actual: '30755.22'.
    • cpu_utilization_vcpus_total/Avg': Target: '117 +- 15.21' vs Actual: '94.09444444444443'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws16K-g2h:
    • throughput/total': Target: '2790 +- 139.5' vs Actual: '2933.7799999999997'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
    • cpu_utilization_vcpus_total/Avg': Target: '121 +- 13.31' vs Actual: '93.31666666666666'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-bd:
    • cpu_utilization_vcpus_total/Avg': Target: '164 +- 11.48' vs Actual: '110.38333333333333'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-bd:
    • cpu_utilization_vcpus_total/Avg': Target: '179 +- 14.32' vs Actual: '110.36666666666667'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
    • throughput/total': Target: '27186 +- 2446.74' vs Actual: '29761.190000000002'.
    • cpu_utilization_vcpus_total/Avg': Target: '119 +- 13.09' vs Actual: '88.22222222222223'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-bd:
    • cpu_utilization_vcpus_total/Avg': Target: '162 +- 11.34' vs Actual: '110.74444444444447'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-bd:
    • cpu_utilization_vcpus_total/Avg': Target: '173 +- 13.84' vs Actual: '113.32222222222225'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
    • throughput/total': Target: '25480 +- 2038.4' vs Actual: '28951.84'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
    • throughput/total': Target: '27140 +- 2442.6' vs Actual: '30825.17'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-ws16K-h2g:
    • throughput/total': Target: '2437 +- 97.48' vs Actual: '2540.68'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-ws16K-h2g:
    • throughput/total': Target: '2435 +- 97.4' vs Actual: '2539.07'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
    • throughput/total': Target: '27587 +- 3034.57' vs Actual: '30905.07'.
    • cpu_utilization_vcpus_total/Avg': Target: '111 +- 17.76' vs Actual: '86.75'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws16K-h2g:
    • throughput/total': Target: '3448 +- 137.92' vs Actual: '3664.0099999999998'.
    • cpu_utilization_vmm/Avg': Target: '53 +- 4.24' vs Actual: '59.044444444444444'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws16K-h2g:
    • throughput/total': Target: '3446 +- 137.84' vs Actual: '3671.63'.
    • cpu_utilization_vmm/Avg': Target: '53 +- 4.24' vs Actual: '59.044444444444444'.
  • vmlinux-4.14.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
    • throughput/total': Target: '26379 +- 2374.11' vs Actual: '30247.85'.
    • cpu_utilization_vmm/Avg': Target: '92 +- 6.44' vs Actual: '98.80000000000001'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
    • throughput/total': Target: '26079 +- 2347.11' vs Actual: '29440.41'.
    • cpu_utilization_vmm/Avg': Target: '92 +- 5.52' vs Actual: '98.43333333333335'.
    • cpu_utilization_vcpus_total/Avg': Target: '98 +- 9.8' vs Actual: '82.60555555555557'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-h2g:
    • throughput/total': Target: '23609 +- 1888.72' vs Actual: '27195.14'.
    • cpu_utilization_vmm/Avg': Target: '81 +- 5.67' vs Actual: '97.7'.

Tests with throughput regression:

  • vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws256K-bd:

    • throughput/total': Target: '25683 +- 1797.81' vs Actual: '22397.14'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws256K-bd:

    • throughput/total': Target: '24634 +- 1724.38' vs Actual: '22106.489999999998'.
  • vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-h2g:

    • throughput/total': Target: '33948 +- 2715.84' vs Actual: '30570.62'.
    • cpu_utilization_vmm/Avg': Target: '91 +- 6.37' vs Actual: '98.11111111111114'.
  • This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes are reflected in firecracker/swagger.yaml.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

Alexandru-Cezar Sardan added 2 commits July 21, 2021 20:11
This allows the guest driver to manage RX virtqueue memory
efficiently removing the need to allocate descriptor chains
that can hold ~65 KB of packet data.
The device will now use multiple descriptor chains to write
the packet to the guest.

Signed-off-by: Alexandru-Cezar Sardan <alsardan@amazon.com>
This triggers the driver to process and replenish the available
ring when the device used half of the ring. This helps in high
traffic scenarios to avoid getting an EmptyQueue error early.

Signed-off-by: Alexandru-Cezar Sardan <alsardan@amazon.com>
@acatangiu
Copy link
Contributor

Somewhat orthogonal question: Why are our tests a mix of 4.9 and 4.14 kernels?

I see regression is only on a few testcases that run 4.9, would a newer guest driver perform better? 🤔

@alsrdn
Copy link
Author

alsrdn commented Jul 22, 2021

Somewhat orthogonal question: Why are our tests a mix of 4.9 and 4.14 kernels?

I see regression is only on a few testcases that run 4.9, would a newer guest driver perform better? 🤔

@acatangiu to answer your second question, there were some changes added to virtio-net in kernel 4.12 that improve performance on the mergeable RX buffers path (see torvalds/linux@680557c). There's also XDP support which I don't think our tests use anyway.
With even newer kernels it doesn't seem much has happened other than improvements on XDP and occasional bug fixes though I didn't look in very much detail.

Copy link
Contributor

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting PR and the results look great!

I only took a superficial look at it, leaving some nits and superficial comments, will come back with an in-depth review.

}

pub(crate) fn vnet_hdr_len() -> usize {
mem::size_of::<virtio_net_hdr_v1>()
}

pub const VNET_HDR_NUM_BUFFERS_OFFSET: usize = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move this to some definitions file if there is any for virtio or virtio-net, or at least move this to the top of this file and add a comment where it's defined (either virtio standard or linux header or smth)

@@ -228,6 +245,11 @@ impl Net {
self.mmds_ns.is_some()
}

// Check if driver uses mergeable RX buffers.
pub fn use_mrg_rx_buffs(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

does this need to be pub?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would define this more generically as a VirtioDevice method. Something like has_feature(feature).

METRICS.net.rx_bytes_count.add(frame_len);
METRICS.net.rx_packets_count.inc();

self.num_deffered_irqs += used_buffs.len() as u16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename num_deffered_irqs to smth like used_bytes_pending_irq or similar.


// When using mergeable RX buffers, the driver usually adds one descriptor per chain.
// When using big buffers the driver must add MAX_BUFFER_SIZE / PAGE_SIZE buffers per chain.
let mut desc_chain_wbuffs = if use_mrg { Vec::with_capacity(1) } else { Vec::with_capacity(17) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's define a constant here derived from MAX_BUFFER_SIZE:

Suggested change
let mut desc_chain_wbuffs = if use_mrg { Vec::with_capacity(1) } else { Vec::with_capacity(17) };
let mut desc_chain_wbuffs = if use_mrg { Vec::with_capacity(1) } else { Vec::with_capacity(MAX_BUFFER_PAGES) };

Comment on lines +382 to +384
queue.undo_multiple_pop(1+buffs.len() as u16);
buffs.clear();
return Err(FrontendError::BrokenQueue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this cleanup sequence is repeated a couple of times, we could define a fn or closure for it

fn reserve_guest_wbuffers(
&mut self,
buffs: &mut Vec<(u16, Vec<(GuestAddress, u32)>)>
) -> std::result::Result<(), FrontendError> {
let mem = match self.device_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mem could come in as a param here to avoid matching state for it again

// Descriptor chain must be able to hold the vnet header at the very least.
if desc_chain_wlen >= vnet_hdr_len() {
total_reserved_sz += desc_chain_wlen;
buffs.push((head_descriptor.index, desc_chain_wbuffs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice if we could somehow avoid the intermediate vec

Comment on lines +294 to +301
fn signal_rx_used_queue_throttled(&mut self) -> result::Result<(), DeviceError> {
// Signal the other side when we used half of the RX queue
if self.num_deffered_irqs > self.queues[RX_INDEX].actual_size()/2 {
return self.signal_used_queue();
}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that this will not be needed anymore after adding support for notification suppression. AFAIU, when notif suppression is negociated the guest net driver asks for notifications when 3/4 of the RX queue buffers were used, which resembles what we do here. It would be worth testing this.

@@ -228,6 +245,11 @@ impl Net {
self.mmds_ns.is_some()
}

// Check if driver uses mergeable RX buffers.
pub fn use_mrg_rx_buffs(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would define this more generically as a VirtioDevice method. Something like has_feature(feature).


// Gets a list of writtable buffers contained in this descriptor chain.
// Returns an error if descriptor chain is malformed.
pub fn get_wo_buffs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems specific to the net device. I wouldn't define it in the Queue implementation. An option would be to create a separate struct that contains the net queues and maybe the rate limiters or other components (VirtioFrontend for example) and define it there. Although this change would be quite big. Or we can simply define this method in the net device.

@@ -173,6 +173,17 @@ impl<'a> VirtqDesc<'a> {
.is_ok());
}

pub fn check_data_from_offset(&self, expected_data: &[u8], offset: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add a separate method for this. I would just extend check_data() and provide offset = 0 to all the old use cases.

@@ -381,6 +429,44 @@ impl Queue {
.map_err(QueueError::UsedRing)
}

/// Puts multiple available descriptor heads into the used ring for use by the guest.
pub fn add_used_bulk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reimplement add_used() by calling add_used_bulk() with only 1 descriptor ? In order to avoid duplicated code.

@dianpopa dianpopa changed the title [RFC] Use RX mergeable buffers [RFC] Use RX mergeable buffers for net device Sep 8, 2021
@alsrdn
Copy link
Author

alsrdn commented Dec 21, 2021

Will resume work on this PR in 2022.

@alsrdn
Copy link
Author

alsrdn commented Jan 25, 2022

This PR is stale for a while so I'm closing it for the moment. We'll reopen this PR after we refactor the changes on top of the current release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants