Less memory checks in the queue #4748

ShadowCurse · 2024-08-23T19:17:22Z

Changes

Replace accesses to the queue objects reside in the guest memory from always using GuestMemoryMmap to storing pointers to them directly. The reason we can do this is that these objects do not move in the guest memory, so storing direct pointers to them is safe.

Reason

This optimization avoid many memory check when accessing guest memory and thus can improves performance of all virtio devices.

Becnes

Added benchmarks for Queue::pop/add_used, DescriptorChain::next_descriptor and Request::parse(from virtio block) and these are results for x86:

queue_pop_1             time:   [5.2791 ns 5.2791 ns 5.2792 ns]
                        change: [-90.046% -90.033% -90.021%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 200 measurements (9.50%)
  3 (1.50%) low mild
  5 (2.50%) high mild
  11 (5.50%) high severe

queue_pop_4             time:   [18.621 ns 18.621 ns 18.622 ns]
                        change: [-89.364% -89.361% -89.358%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 200 measurements (4.00%)
  4 (2.00%) low mild
  2 (1.00%) high mild
  2 (1.00%) high severe

queue_pop_16            time:   [71.963 ns 71.964 ns 71.966 ns]
                        change: [-89.351% -89.347% -89.343%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 200 measurements (6.00%)
  5 (2.50%) high mild
  7 (3.50%) high severe

queue_add_used_1        time:   [1.9462 ns 1.9463 ns 1.9465 ns]
                        change: [-92.597% -92.595% -92.593%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 200 measurements (4.50%)
  8 (4.00%) high mild
  1 (0.50%) high severe

queue_add_used_16       time:   [55.649 ns 55.656 ns 55.666 ns]
                        change: [-86.365% -86.349% -86.327%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 200 measurements (3.50%)
  7 (3.50%) high severe

queue_add_used_256      time:   [720.92 ns 720.93 ns 720.94 ns]
                        change: [-87.653% -87.649% -87.646%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 200 measurements (4.50%)
  4 (2.00%) high mild
  5 (2.50%) high severe

next_descriptor_1       time:   [10.002 ns 10.003 ns 10.003 ns]
                        change: [-74.067% -74.046% -74.026%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 200 measurements (11.50%)
  8 (4.00%) low mild
  9 (4.50%) high mild
  6 (3.00%) high severe

next_descriptor_2       time:   [10.021 ns 10.023 ns 10.024 ns]
                        change: [-80.125% -80.113% -80.103%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 200 measurements (0.50%)
  1 (0.50%) high severe

next_descriptor_4       time:   [12.431 ns 12.459 ns 12.489 ns]
                        change: [-83.198% -83.167% -83.133%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 200 measurements (7.50%)
  12 (6.00%) high mild
  3 (1.50%) high severe

next_descriptor_16      time:   [60.834 ns 61.115 ns 61.393 ns]
                        change: [-73.133% -72.990% -72.853%] (p = 0.00 < 0.05)
                        Performance has improved.

request_parse           time:   [833.56 ps 833.57 ps 833.58 ps]
                        change: [-39.999% -39.997% -39.995%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 200 measurements (11.50%)
  10 (5.00%) low mild
  8 (4.00%) high mild
  5 (2.50%) high severe

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2024-08-28T14:19:25Z

Codecov Report

Attention: Patch coverage is 92.78689% with 22 lines in your changes missing coverage. Please review.

Project coverage is 84.31%. Comparing base (ff5213e) to head (9b0a795).
Report is 21 commits behind head on main.

Files with missing lines	Patch %	Lines
src/vmm/src/devices/virtio/queue.rs	94.76%	9 Missing ⚠️
src/vmm/src/devices/virtio/device.rs	0.00%	5 Missing ⚠️
src/vmm/src/persist.rs	55.55%	4 Missing ⚠️
src/vmm/src/devices/virtio/vsock/device.rs	83.33%	2 Missing ⚠️
src/vmm/src/devices/virtio/balloon/device.rs	90.00%	1 Missing ⚠️
src/vmm/src/devices/virtio/persist.rs	97.05%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4748      +/-   ##
==========================================
- Coverage   84.34%   84.31%   -0.03%     
==========================================
  Files         249      249              
  Lines       27460    27522      +62     
==========================================
+ Hits        23160    23206      +46     
- Misses       4300     4316      +16

Flag	Coverage Δ
5.10-c5n.metal	`84.54% <92.78%> (-0.03%)`	⬇️
5.10-m5n.metal	`84.52% <92.78%> (-0.03%)`	⬇️
5.10-m6a.metal	`83.81% <92.78%> (-0.03%)`	⬇️
5.10-m6g.metal	`80.89% <92.78%> (-0.02%)`	⬇️
5.10-m6i.metal	`84.51% <92.78%> (-0.03%)`	⬇️
5.10-m7g.metal	`80.89% <92.78%> (-0.02%)`	⬇️
6.1-c5n.metal	`84.54% <92.78%> (-0.02%)`	⬇️
6.1-m5n.metal	`84.52% <92.78%> (-0.02%)`	⬇️
6.1-m6a.metal	`83.81% <92.78%> (-0.03%)`	⬇️
6.1-m6g.metal	`80.89% <92.78%> (-0.01%)`	⬇️
6.1-m6i.metal	`84.51% <92.78%> (-0.03%)`	⬇️
6.1-m7g.metal	`80.89% <92.78%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Right now, we are performing two copies for writing a frame from the TAP device into guest memory. We first read the frame in an array held by the Net device and then copy that array in a DescriptorChain. In order to avoid the double copy use the readv system call to read directly from the TAP device into the buffers described by DescriptorChain. The main challenge with this is that DescriptorChain objects describe memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO are enabled or 1526 otherwise and parsing the chain includes overhead which we pay even if the frame we are receiving is much smaller than these sizes. PR firecracker-microvm#4748 reduced the overheads involved with parsing DescriptorChain objects. To further avoid this overhead, move the parsing of DescriptorChain objects out of the hot path of process_rx() where we are actually receiving a frame into process_rx_queue_event() where we get the notification that the guest added new buffers for network RX. Signed-off-by: Babis Chalios <bchalios@amazon.es>

Right now, we are performing two copies for writing a frame from the TAP device into guest memory. We first read the frame in an array held by the Net device and then copy that array in a DescriptorChain. In order to avoid the double copy use the readv system call to read directly from the TAP device into the buffers described by DescriptorChain. The main challenge with this is that DescriptorChain objects describe memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO are enabled or 1526 otherwise and parsing the chain includes overhead which we pay even if the frame we are receiving is much smaller than these sizes. PR #4748 reduced the overheads involved with parsing DescriptorChain objects. To further avoid this overhead, move the parsing of DescriptorChain objects out of the hot path of process_rx() where we are actually receiving a frame into process_rx_queue_event() where we get the notification that the guest added new buffers for network RX. Signed-off-by: Babis Chalios <bchalios@amazon.es>

Right now, we are performing two copies for writing a frame from the TAP device into guest memory. We first read the frame in an array held by the Net device and then copy that array in a DescriptorChain. In order to avoid the double copy use the readv system call to read directly from the TAP device into the buffers described by DescriptorChain. The main challenge with this is that DescriptorChain objects describe memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO are enabled or 1526 otherwise and parsing the chain includes overhead which we pay even if the frame we are receiving is much smaller than these sizes. PR firecracker-microvm#4748 reduced the overheads involved with parsing DescriptorChain objects. To further avoid this overhead, move the parsing of DescriptorChain objects out of the hot path of process_rx() where we are actually receiving a frame into process_rx_queue_event() where we get the notification that the guest added new buffers for network RX. Signed-off-by: Babis Chalios <bchalios@amazon.es>

Right now, we are performing two copies for writing a frame from the TAP device into guest memory. We first read the frame in an array held by the Net device and then copy that array in a DescriptorChain. In order to avoid the double copy use the readv system call to read directly from the TAP device into the buffers described by DescriptorChain. The main challenge with this is that DescriptorChain objects describe memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO are enabled or 1526 otherwise and parsing the chain includes overhead which we pay even if the frame we are receiving is much smaller than these sizes. PR firecracker-microvm#4748 reduced the overheads involved with parsing DescriptorChain objects. To further avoid this overhead, move the parsing of DescriptorChain objects out of the hot path of process_rx() where we are actually receiving a frame into process_rx_queue_event() where we get the notification that the guest added new buffers for network RX. Co-authored-by: Babis Chalios <bchalios@amazon.es> Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>

Right now, we are performing two copies for writing a frame from the TAP device into guest memory. We first read the frame in an array held by the Net device and then copy that array in a DescriptorChain. In order to avoid the double copy use the readv system call to read directly from the TAP device into the buffers described by DescriptorChain. The main challenge with this is that DescriptorChain objects describe memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO are enabled or 1526 otherwise and parsing the chain includes overhead which we pay even if the frame we are receiving is much smaller than these sizes. PR #4748 reduced the overheads involved with parsing DescriptorChain objects. To further avoid this overhead, move the parsing of DescriptorChain objects out of the hot path of process_rx() where we are actually receiving a frame into process_rx_queue_event() where we get the notification that the guest added new buffers for network RX. Co-authored-by: Babis Chalios <bchalios@amazon.es> Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>