[RFC] Use RX mergeable buffers for net device #2672

alsrdn · 2021-07-21T18:03:36Z

Reason for This PR

This is an RFC for implementing mergeable RX buffers for the net device.
Snapshot tests fail since Virtio features need to be negotiated.

Addresses #1314

Description of Changes

This patch enables the net device to use mergeable RX buffers instead of big buffers.
Mergeable RX buffers is a virtio-net feature that allows the device to use multiple descriptor chains to write the RX packet to instead of just one. The number of chains used is specified in the virtio-net header attached to each packet.
The specification of this Virtio feature can be consulted here: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-2140004

Without this feature, the driver allocates 65562 Bytes (17 4K pages) in a single descriptor chain. This means that if the underlying tap device used to receive packets has smaller MTU, the difference will always be wasted.
With mergeable RX buffers turned on, Linux virtio-net drivers will adjust the size of the buffers allocated based on the size of previous received packets in order to conserve memory. This allows the driver to use a different allocator (skb_frag allocator) instead of reserving full pages.

Performance

Average CPU utilisation of the vmm increases with 7-12% with host to guest tests that use DEFAULT or 256K window sizes keeping same baseline throughput performance.

Tests with improved throughput and vCPU utilisation:

vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws16K-g2h:
- throughput/total': Target: '2794 +- 139.7' vs Actual: '2945.32'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
- throughput/total': Target: '27381 +- 3285.72' vs Actual: '30755.22'.
- cpu_utilization_vcpus_total/Avg': Target: '117 +- 15.21' vs Actual: '94.09444444444443'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws16K-g2h:
- throughput/total': Target: '2790 +- 139.5' vs Actual: '2933.7799999999997'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
- cpu_utilization_vcpus_total/Avg': Target: '121 +- 13.31' vs Actual: '93.31666666666666'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-bd:
- cpu_utilization_vcpus_total/Avg': Target: '164 +- 11.48' vs Actual: '110.38333333333333'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-bd:
- cpu_utilization_vcpus_total/Avg': Target: '179 +- 14.32' vs Actual: '110.36666666666667'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
- throughput/total': Target: '27186 +- 2446.74' vs Actual: '29761.190000000002'.
- cpu_utilization_vcpus_total/Avg': Target: '119 +- 13.09' vs Actual: '88.22222222222223'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-bd:
- cpu_utilization_vcpus_total/Avg': Target: '162 +- 11.34' vs Actual: '110.74444444444447'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-bd:
- cpu_utilization_vcpus_total/Avg': Target: '173 +- 13.84' vs Actual: '113.32222222222225'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
- throughput/total': Target: '25480 +- 2038.4' vs Actual: '28951.84'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
- throughput/total': Target: '27140 +- 2442.6' vs Actual: '30825.17'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-ws16K-h2g:
- throughput/total': Target: '2437 +- 97.48' vs Actual: '2540.68'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-ws16K-h2g:
- throughput/total': Target: '2435 +- 97.4' vs Actual: '2539.07'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
- throughput/total': Target: '27587 +- 3034.57' vs Actual: '30905.07'.
- cpu_utilization_vcpus_total/Avg': Target: '111 +- 17.76' vs Actual: '86.75'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws16K-h2g:
- throughput/total': Target: '3448 +- 137.92' vs Actual: '3664.0099999999998'.
- cpu_utilization_vmm/Avg': Target: '53 +- 4.24' vs Actual: '59.044444444444444'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws16K-h2g:
- throughput/total': Target: '3446 +- 137.84' vs Actual: '3671.63'.
- cpu_utilization_vmm/Avg': Target: '53 +- 4.24' vs Actual: '59.044444444444444'.
vmlinux-4.14.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-g2h:
- throughput/total': Target: '26379 +- 2374.11' vs Actual: '30247.85'.
- cpu_utilization_vmm/Avg': Target: '92 +- 6.44' vs Actual: '98.80000000000001'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-g2h:
- throughput/total': Target: '26079 +- 2347.11' vs Actual: '29440.41'.
- cpu_utilization_vmm/Avg': Target: '92 +- 5.52' vs Actual: '98.43333333333335'.
- cpu_utilization_vcpus_total/Avg': Target: '98 +- 9.8' vs Actual: '82.60555555555557'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-pDEFAULT-wsDEFAULT-h2g:
- throughput/total': Target: '23609 +- 1888.72' vs Actual: '27195.14'.
- cpu_utilization_vmm/Avg': Target: '81 +- 5.67' vs Actual: '97.7'.

Tests with throughput regression:

vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-p1024K-ws256K-bd:
- throughput/total': Target: '25683 +- 1797.81' vs Actual: '22397.14'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/2vcpu_1024mb.json/tcp-pDEFAULT-ws256K-bd:
- throughput/total': Target: '24634 +- 1724.38' vs Actual: '22106.489999999998'.
vmlinux-4.9.bin/ubuntu-18.04.ext4/1vcpu_1024mb.json/tcp-p1024K-wsDEFAULT-h2g:
- throughput/total': Target: '33948 +- 2715.84' vs Actual: '30570.62'.
- cpu_utilization_vmm/Avg': Target: '91 +- 6.37' vs Actual: '98.11111111111114'.
This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

All commits in this PR are signed (git commit -s).
The reason for this PR is clearly provided (issue no. or explanation).
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
Any newly added unsafe code is properly documented.
Any API changes are reflected in firecracker/swagger.yaml.
Any user-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.

This allows the guest driver to manage RX virtqueue memory efficiently removing the need to allocate descriptor chains that can hold ~65 KB of packet data. The device will now use multiple descriptor chains to write the packet to the guest. Signed-off-by: Alexandru-Cezar Sardan <alsardan@amazon.com>

This triggers the driver to process and replenish the available ring when the device used half of the ring. This helps in high traffic scenarios to avoid getting an EmptyQueue error early. Signed-off-by: Alexandru-Cezar Sardan <alsardan@amazon.com>

acatangiu · 2021-07-22T14:42:37Z

Somewhat orthogonal question: Why are our tests a mix of 4.9 and 4.14 kernels?

I see regression is only on a few testcases that run 4.9, would a newer guest driver perform better? 🤔

alsrdn · 2021-07-22T16:54:05Z

Somewhat orthogonal question: Why are our tests a mix of 4.9 and 4.14 kernels?

I see regression is only on a few testcases that run 4.9, would a newer guest driver perform better? 🤔

@acatangiu to answer your second question, there were some changes added to virtio-net in kernel 4.12 that improve performance on the mergeable RX buffers path (see torvalds/linux@680557c). There's also XDP support which I don't think our tests use anyway.
With even newer kernels it doesn't seem much has happened other than improvements on XDP and occasional bug fixes though I didn't look in very much detail.

acatangiu

Really interesting PR and the results look great!

I only took a superficial look at it, leaving some nits and superficial comments, will come back with an in-depth review.

acatangiu · 2021-07-22T14:47:20Z