Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement AWS ena driver #1283

Merged
merged 14 commits into from
Jan 11, 2024
Merged

Conversation

wkozaczuk
Copy link
Collaborator

@wkozaczuk wkozaczuk commented Nov 29, 2023

This pull request implements the AWS ena driver by porting the FreeBSD version from https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena.

The objective of this porting exercise is to adapt the FreeBSD code to make it work in OSv and at the same time minimize changes so that we can backport any potential bug fixes or enhancements in the future. On top of it, we also reduce the code footprint by eliminating features that are either not relevant to OSv or not needed at this point (for example RSS). The resulting driver does NOT implement the following features:

  • LLQ (Low-latency Queue)
  • RSS (Receive-side Scaling) - we very much want to backport this part as soon as possible though
  • netmap framework
  • most of the sysctl and ioctl functionality
  • interrupt-driven handling of admin command completions = we default to poll mode

Even though this driver implements stateless offloads - TXCSUM, RXCSUM, TSO, LRO - (just like the original FreeBSD one - https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#stateless-offloads), the underlying ENA device does NOT implement RXCSUM nor TSO (see amzn/amzn-drivers#29). It also looks like the LRO logic never gets activated based on the observed values of relevant tracepoints.

The details of the changes are explained in each commit that is part of this PR.

The design of the driver is 3-layered:

  • low-level implemented under the bsd/sys/contrib/ena_com part of the source tree
  • middle-level implemented under the bsd/sys/dev/ena part of the source tree
  • thin high-level implemented in drivers/ena.*

The resulting driver "costs" us ~7k lines of mostly C code and ~56K larger kernel binary size.

This implementation is functional and has been tested on an actual Nitro EC2 instance (t3 nano only for now) and seems to be stable. The preliminary stress tests suggest that OSv instance with a simple hello world golang http server can handle ~ 45-50K requests per second:

wrk --latency -t8 -d10s -c 100 http://x.x.x.x:9000/
Running 10s test @ http://x.x.x.x:9000/
  8 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.39ms    3.09ms 100.82ms   93.87%
    Req/Sec     6.25k     1.12k    8.59k    71.38%
  Latency Distribution
     50%    1.73ms
     75%    2.65ms
     90%    4.15ms
     99%   16.94ms
  498574 requests in 10.02s, 77.50MB read
Requests/sec:  49739.42
Transfer/sec:      7.73MB

Connecting with OSv cli and executing top shows the following thread dump with 4 ena threads - cleanup and enqueue for each vCPU:

43 threads on 2 CPUs; 100% 100% 200%
ID CPU %CPU TIME  NAME            
26 1   0.1  15.44 rand_harvestq   
15 1   0.0  0.01  page_pool_l1_1  
43 0   0.0  0.01  >>/httpserver.s 
16 1   0.0  0.01  percpu1         
14 1   0.0  0.00  rcu1            
1  0   0.0  0.00  reclaimer       
2  0   0.0  0.61  kvm_wall_clock_ 
17 1   0.0  0.00  async_worker1   
3  0   0.0  0.01  page_pool_l2    
18 0   0.0  0.00  >init           
19 0   0.0  0.00  thread taskq    
20 0   0.0  8.19  callout         
25 0   0.0  0.00  kbd-input       
27 1   0.0  0.00  ena_tx_enque_0  
4  -   0.0  0.00  itimer-real     
24 0   0.0  0.00  isa-serial-inpu 
22 0   0.0  0.43  >init           
23 0   0.0  0.00  ena rstq        
21 0   0.0  0.00  netisr          
28 1   0.0  0.00  ena_tx_enque_1  
5  -   0.0  0.00  itimer-virt     
7  0   0.0  0.00  rcu0            
38 0   0.0  0.00  >/httpserver.so 
37 1   0.0  11.62 >/httpserver.so 
36 0   0.0  2.34  >/httpserver.so 
35 0   0.0  0.32  >/httpserver.so 
39 1   0.0  9.52  >/httpserver.so 
41 1   0.0  3.12  >>/httpserver.s 
40 0   0.0  9.19  >>/httpserver.s 
42 1   0.0  1.26  >>/httpserver.s 
6  0   0.0  3.21  balancer0       
34 0   0.0  5.45  /httpserver.so  
32 1   0.0  0.06  /libhttpserver- 
9  0   0.0  0.10  percpu0         
8  0   0.0  0.01  page_pool_l1_0  
33 1   0.0  0.00  timerfd         
10 0   0.0  0.01  async_worker0   
30 1   0.0  1.01  ena_clean_que_1 
13 0   0.0  0.00  >init           
31 1   0.0  0.03  dhcp            
29 0   0.0  3.80  ena_clean_que_0 

Compared to the initial version of the PR, this one adds new tracepoints and pins cleanup worker threads, and corresponding MSIX vectors to the same CPU. Pinning the worker threads minimizes the number of IPIs and seems to improve performance by 5-10% at least based on one of the tests conducted.

One can also connect to the running OSv instance serial console from AWS web console.

To run OSv on Nitro instance without NVMe we build a ramfs-based image like so:

./scripts/build image=golang-pie-httpserver,httpserver-monitoring-api fs=ramfs -j$(nproc) fs_size_mb=32

It can be then deployed to AWS as AMI by executing the following:

./scripts/deploy_to_aws.sh <name>

The script above also creates a stack with a single EC2 instance running the image. For more details please read this commit comments - 873cb55

Closes #1204

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented Nov 30, 2023

Some iperf3 results:

iperf3 -t 5 -c 172.31.89.244
Connecting to host 172.31.89.244, port 5201
[  5] local 172.31.90.167 port 58618 connected to 172.31.89.244 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   407 MBytes  3.42 Gbits/sec  2098    146 KBytes       
[  5]   1.00-2.00   sec   296 MBytes  2.49 Gbits/sec  2786    308 KBytes       
[  5]   2.00-3.00   sec   277 MBytes  2.33 Gbits/sec  1436    605 KBytes       
[  5]   3.00-4.00   sec   406 MBytes  3.41 Gbits/sec  1111    328 KBytes       
[  5]   4.00-5.00   sec   397 MBytes  3.33 Gbits/sec  1879    240 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec  1.74 GBytes  2.99 Gbits/sec  9310             sender
[  5]   0.00-5.04   sec  1.74 GBytes  2.97 Gbits/sec                  receiver

iperf Done.

and server:

Accepted connection from 172.31.90.167, port 58606
[  9] local 172.31.89.244 port 5201 connected to 172.31.90.167 port 58618
iperf3: getsockopt - Invalid argument
[ ID] Interval           Transfer     Bitrate
[  9]   0.00-1.00   sec   390 MBytes  3.27 Gbits/sec                  
iperf3: getsockopt - Invalid argument
[  9]   1.00-2.00   sec   308 MBytes  2.59 Gbits/sec                  
iperf3: getsockopt - Invalid argument
[  9]   2.00-3.00   sec   270 MBytes  2.27 Gbits/sec                  
iperf3: getsockopt - Invalid argument
[  9]   3.00-4.00   sec   403 MBytes  3.38 Gbits/sec                  
iperf3: getsockopt - Invalid argument
[  9]   4.00-5.00   sec   396 MBytes  3.32 Gbits/sec                  
iperf3: getsockopt - Invalid argument
[  9]   5.00-5.04   sec  15.1 MBytes  3.61 Gbits/sec                  
[E/29 bsd-log]: Limiting open port RST response from 240 to 200 packets/sec
- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bitrate
[  9]   0.00-5.04   sec  1.74 GBytes  2.97 Gbits/sec                  receiver
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

I wonder, what does [E/29 bsd-log]: Limiting open port RST response from 240 to 200 packets/sec indicate?

Bear in mind that the t3 networking limit is up to 5 Gigabit

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented Dec 1, 2023

And some results from netperf tests (OSv running the netserver):

....
I/31 dhcp]: Configuring eth0: ip 172.31.84.22 subnet mask 255.255.240.0 gateway 172.31.80.1 MTU 9001
[I/31 dhcp]: Set hostname to: ip-172-31-84-22
Booted up in 868.94 ms
Cmdline: /tools/netserver.so -D -4 -f -L 0.0.0.0
Running from /init/30-auto-00: /libhttpserver-api.so --access-allow=true &!
Rest API server running on port 8000
Starting netserver with host '0.0.0.0' port '12865' and family AF_INET
netperf -H 172.31.84.22
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.31.84.22 () port 0 AF_INET : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  16384  16384    10.01    3725.82  

…-drivers

- release TAG - ena_linux_2.10.0
- commit - e715298d09c6a4c378d5178c71515c43c1a75a8e

Please note the C files are copied as *.cc to help review follow-up changes
to this code.

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This adapts ena_com/ena_plat.h by replacing some unsupported FreeBSD
mechanisms with the OSv equivalent ones.

Specifically it:

- changes FreeBSD header include paths to match OSv source tree

- reimplements ENA_*SLEEP and ENA_UDELAY macros to use busy_sleep()
  function instead of pause_sbt(); these macros are used in ena_com.cc
  where we cannot use regular sleep mechanism

- reimplements ENA_SPINLOCK_* macros to use new OSv irq_spinlock_*
  methods which are defined in later patch

- removes ENA_WAIT_* macros which are not needed because we use
  the polling mode when submitting and processing admin commands
  (like for example create an I/O queue for RX or TX)

- removes FreeBSD bus_dma* functions and replaces where needed
  with OSv equivalent code

- replaces FreeBSD way of handling PCI by adapting code to
  use OSv pci::bar and reg_bar->readl() and reg_bar->writel()

- converts C casts to C++ ones

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
It turns out the ena driver code uses spinlocks (see ENA_SPINLOCK_*
macros) in relatively few places when submitting and processing admin
commands which happens during the ena device attach and detach stage.
The analysis of the FreeBSD version of mutex with type MTX_SPIN and
mtx_lock_spin() and mtx_unlock_spin() (see
https://man.freebsd.org/cgi/man.cgi?query=mtx_lock_spin) indicates
the interrupts should be disabled before spinning.

For that reason we add new type of spinlock - irq_spinlock - which
is almost identical to regular spinlock but uses irq_lock
to disable and enable interrupts before acquiring a lock and
after releasing respectively.

At the same time, this commit also adjusts the spinning loop
to use correct architecture specific instruction.

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
The ena_eth_com.cc is one of the 2 source files that make up
a low-level ena_com API. This part is used in the intermediate level
to implement data path functionality.

This patch uses C++ constructs to apply type conversions where
necessary.

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
The ena_com.cc is the 2nd of the 2 source files that make up
a low-level ena_com API. This part is used in the intermediate level
to mainly implement the admin functionality like for example
creating I/O queues. See https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#ena-source-code-directory-structure
for more insight.

This patch:

- uses C++ constructs to apply type conversions where necessary.

- eliminates the MSI-X interrupt-based logic to handle completions
  of admin commands (see ena_com_wait_and_process_admin_cq_interrupts())
  and leaves the polling mode logic the default one

- eliminates the RSS (Receive-Side Scaling) related code for now

- implements busy_sleep() used by ENA_USLEEP and EN_UDELAT macros

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch imports new lock-less structure - buf_ring from FreeBSD
source tree (see https://man.freebsd.org/cgi/man.cgi?query=buf_ring).
The buf_ring is used by ENA driver as a multiple-producer,
single-consumer lockless ring for buffering extra mbufs coming from
the stack in case the Tx procedure is busy sending the packets or
the Tx ring is full.

OSv has its own lock-less sigle-producer single-consumer ring implementation
(see include/lockfree/ring.hh> but it is not clear if and how we could
somehow adapt it in similar way unordered-queue-mpsc.hh does to implement
multiple-producer single-consumer collection that does not preserve
insertion order. Given that, I have found it easier to import and use the
FreeBSD version of it as is.

Please note the original FreeBSD ena code uses drbr_* functions that
delegate to buf_ring_* or ALTQ if it is enabled (see
https://man.freebsd.org/cgi/man.cgi?query=drbr_enqueue_). Given OSv
does not implement ALTQ
(https://www.usenix.org/legacy/publications/library/proceedings/lisa97/failsafe/usenix98/full_papers/cho/cho_html/cho.html#ALTQ),
the adapted version of ena driver ends up using the buf_ring_* functions
directly.

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch adapts the middle layer of data path handling logic
to make it work in OSv. For more details about it please see
https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#data-path-interface.
and https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#data-path.

In high level the main entry point for RX part is ena_cleanup() that
delegates to ena_rx_cleanup() and eventually ends up calling net
channel ifp->if_classifier.post_packet() (fast path) or ifp->if_input()
(slow path). The ena_cleanup() is called by cleanup_work thread that
is woken every time the MSI-X vector for given TX/RX queue is called.

Similarly, the main entry point for TX part is ena_mq_start() which
is what ifp->if_transmit is set to and ena_deferred_mq_start() which is
called by enqueue_work thread that is woken in ena_mq_start() and
ena_tx_cleanup() (other part of ena_cleanup() routine).

Finally, ena_qflush is what ifp->if_qflush is set to.

The particular code changes to ena_datapath.* involve following:

- implement critical_enter()/critical_exit() used by buf_ring (see
  https://man.freebsd.org/cgi/man.cgi?query=critical_enter)

- for now remove RSS and DEV_NETMAP related code

- replace the drbr_* functions with buf_ring_* equivalent ones

- replace taskqueue_enqueue() with OSv wake_with()

- adapt references to the mbuf fields to match OSv version of
  it (please freebsd/freebsd-src@3d1a9ed
  commit that changed the layout of mbuf struct a bit)

- simplify ena_tx_map_mbuf() given we hard-code to use ENA_ADMIN_PLACEMENT_POLICY_HOST
  TX queue type and do not use bus_dma API (see
  https://man.freebsd.org/cgi/man.cgi?query=bus_dma)

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch adapts the admin/setup header ena.h to OSv.

In particular it addresses following:

- import atomic bitset support from FreeBSD tree (see
  https://github.com/freebsd/freebsd-src/blob/main/sys/sys/_bitset.h)

- remove unnecessary fields from ena_adapter struct

- replace the IRQ related fields with OSv equivalent (see ena_irq)

- replace cleanup_task and cleanup_tq in ena_qeu struct with OSv
  equivalent cleanup_thread

- replace enqueue_task and enqueue_tq in ena_ring struct with OSv
  equivalent enqueue_thread

- remove RSS and DEV_NETMAP artifacts

- for now define counter_* macros to disable related functionality

- replace callout_reset_sbt() with equivalent callout_reset()

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch adapts the middle layer of the admin and device setup/teardown
handling logic to make it work in OSv. It is also the last patch
to complete the porting work of FreeBSD ena driver code to work in
OSv.

The code in ena.cc mostly implements the logic to probe, attach
and detach the device and involves interacting with lower-level admin
API of ena_com/ena_com.cc to submit commands to Admin Queue (AQ) and
receive and process completions from Admin Completion Queue (ACQ).
It also implements interrupt handlers and worker threads to process I/O.
For more details read
https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#management-interface.

In particular this patch addresses following:

- change FreeBSD header include paths to match OSv source tree

- eliminate most DMA-related functions ena_*dma_*()

- eliminate metrics task code for now

- eliminate LLQ, RSS and DEV_NETMAP related code

- deactivate counters (aka statistics collection) code

- rewrite ena_dma_alloc() to use OSv memory::alloc_phys_contiguous_aligned()
  and mmu::virt_to_phys() (it probably should not have *dma* in name)

- rewrite the functions that setup MSI/X and implement other PCI-related
  functionality to use OSv PCI code from drivers/pci-* and arch/*/msi.** -
  ena_free_pci_resources(), ena_probe(), ena_enable_msix(),
  ena_setup_mgmnt_intr(), ena_setup_io_intr(), ena_request_mgmnt_irq(),
  ena_request_io_irq(), ena_free_io_irq(), ena_disable_msix()

- replace the calls to drbr_*() functions with buf_ring_*() equivalent ones

- implement the main function of the enqueue worker thread -
  enqueue_work(); this function is used when setting TX resource in
  ena_setup_tx_resources() and replaces FreeBSD version of it -
  enqueue_tq and enqueue_task

- simplify ena_alloc_rx_mbuf() by mostly not using the DMA-related code

- eliminate ena_update_buf_ring_size(), ena_update_queue_size(),
  ena_update_io_rings(), ena_update_io_queue_nb() which are not needed
  as OSv will not support changing ring and queue size (see for example
  https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#size-of-the-tx-buffer-ring-drbr)
  through ioctl()

- simplify ena_ioctl()

- implement the main function of the cleanup worker thread -
  cleanup_work(); this function is used when setting I/O queues
  in ena_create_io_queues() and replaces FreeBSD version of it -
  cleanup_tq and cleanup_task

- adjust CSUM_* constant to match the version of OSv version of FreeBSD
  headers

- replace if_set*() function calls with equivalent code directly setting
  fields of if_t structure (for example if_settransmitfn(ifp,
  ena_mq_start) => ifp->if_transmit = ena_mq_start)

- hardcode TX queue memory type to ENA_ADMIN_PLACEMENT_POLICY_HOST (we
  do not support LLQ)

- eliminate LLQ-related code - ena_map_llq_mem_bar(),
  set_default_llq_configurations() and any ifs testing ENA_ADMIN_PLACEMENT_POLICY_DEV

- adapt code reading current boot time to use osv::clock::uptime::now()

- adapt ena_handle_msix() and other places to use OSV wake_with_irq_or_preemption_disabled()
  instead of taskqueue_enqueue()

- add remaining *cc files to the Makefile - everything should compile

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This almost final patch implements a very upper "thin" layer in form of
the aws::ena driver class that subclasses from hw_driver.

The contructor, destructor and probe() merely delegate to functions
ena_attach(), ena_detach() and ena_probe() respectively implemented
in bsd/sys/dev/ena/ena.cc.

Please note that some of the statistics functionality (see fill_stats())
and if_getinfo are left unimplemented for now.

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This last patch improves certain aspects of the driver implementation:
- completes LRO handling
- adds number of tracepoints to help trubleshoot and analaze performance
- pins cleanup worker thread and corresponding MSIX vector to a cpu

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This was referenced Dec 11, 2023
@wkozaczuk wkozaczuk merged commit cb7d180 into cloudius-systems:master Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement AWS ENA driver
1 participant