-
-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement AWS ena driver #1283
Merged
Merged
Implement AWS ena driver #1283
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wkozaczuk
force-pushed
the
aws_ena_pr
branch
from
November 30, 2023 19:38
5838411
to
b6c22f8
Compare
Some
and server:
I wonder, what does Bear in mind that the t3 networking limit is up to 5 Gigabit |
And some results from netperf tests (OSv running the netserver):
|
…-drivers - release TAG - ena_linux_2.10.0 - commit - e715298d09c6a4c378d5178c71515c43c1a75a8e Please note the C files are copied as *.cc to help review follow-up changes to this code. Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This adapts ena_com/ena_plat.h by replacing some unsupported FreeBSD mechanisms with the OSv equivalent ones. Specifically it: - changes FreeBSD header include paths to match OSv source tree - reimplements ENA_*SLEEP and ENA_UDELAY macros to use busy_sleep() function instead of pause_sbt(); these macros are used in ena_com.cc where we cannot use regular sleep mechanism - reimplements ENA_SPINLOCK_* macros to use new OSv irq_spinlock_* methods which are defined in later patch - removes ENA_WAIT_* macros which are not needed because we use the polling mode when submitting and processing admin commands (like for example create an I/O queue for RX or TX) - removes FreeBSD bus_dma* functions and replaces where needed with OSv equivalent code - replaces FreeBSD way of handling PCI by adapting code to use OSv pci::bar and reg_bar->readl() and reg_bar->writel() - converts C casts to C++ ones Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
It turns out the ena driver code uses spinlocks (see ENA_SPINLOCK_* macros) in relatively few places when submitting and processing admin commands which happens during the ena device attach and detach stage. The analysis of the FreeBSD version of mutex with type MTX_SPIN and mtx_lock_spin() and mtx_unlock_spin() (see https://man.freebsd.org/cgi/man.cgi?query=mtx_lock_spin) indicates the interrupts should be disabled before spinning. For that reason we add new type of spinlock - irq_spinlock - which is almost identical to regular spinlock but uses irq_lock to disable and enable interrupts before acquiring a lock and after releasing respectively. At the same time, this commit also adjusts the spinning loop to use correct architecture specific instruction. Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
The ena_eth_com.cc is one of the 2 source files that make up a low-level ena_com API. This part is used in the intermediate level to implement data path functionality. This patch uses C++ constructs to apply type conversions where necessary. Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
The ena_com.cc is the 2nd of the 2 source files that make up a low-level ena_com API. This part is used in the intermediate level to mainly implement the admin functionality like for example creating I/O queues. See https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#ena-source-code-directory-structure for more insight. This patch: - uses C++ constructs to apply type conversions where necessary. - eliminates the MSI-X interrupt-based logic to handle completions of admin commands (see ena_com_wait_and_process_admin_cq_interrupts()) and leaves the polling mode logic the default one - eliminates the RSS (Receive-Side Scaling) related code for now - implements busy_sleep() used by ENA_USLEEP and EN_UDELAT macros Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch imports new lock-less structure - buf_ring from FreeBSD source tree (see https://man.freebsd.org/cgi/man.cgi?query=buf_ring). The buf_ring is used by ENA driver as a multiple-producer, single-consumer lockless ring for buffering extra mbufs coming from the stack in case the Tx procedure is busy sending the packets or the Tx ring is full. OSv has its own lock-less sigle-producer single-consumer ring implementation (see include/lockfree/ring.hh> but it is not clear if and how we could somehow adapt it in similar way unordered-queue-mpsc.hh does to implement multiple-producer single-consumer collection that does not preserve insertion order. Given that, I have found it easier to import and use the FreeBSD version of it as is. Please note the original FreeBSD ena code uses drbr_* functions that delegate to buf_ring_* or ALTQ if it is enabled (see https://man.freebsd.org/cgi/man.cgi?query=drbr_enqueue_). Given OSv does not implement ALTQ (https://www.usenix.org/legacy/publications/library/proceedings/lisa97/failsafe/usenix98/full_papers/cho/cho_html/cho.html#ALTQ), the adapted version of ena driver ends up using the buf_ring_* functions directly. Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch adapts the middle layer of data path handling logic to make it work in OSv. For more details about it please see https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#data-path-interface. and https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#data-path. In high level the main entry point for RX part is ena_cleanup() that delegates to ena_rx_cleanup() and eventually ends up calling net channel ifp->if_classifier.post_packet() (fast path) or ifp->if_input() (slow path). The ena_cleanup() is called by cleanup_work thread that is woken every time the MSI-X vector for given TX/RX queue is called. Similarly, the main entry point for TX part is ena_mq_start() which is what ifp->if_transmit is set to and ena_deferred_mq_start() which is called by enqueue_work thread that is woken in ena_mq_start() and ena_tx_cleanup() (other part of ena_cleanup() routine). Finally, ena_qflush is what ifp->if_qflush is set to. The particular code changes to ena_datapath.* involve following: - implement critical_enter()/critical_exit() used by buf_ring (see https://man.freebsd.org/cgi/man.cgi?query=critical_enter) - for now remove RSS and DEV_NETMAP related code - replace the drbr_* functions with buf_ring_* equivalent ones - replace taskqueue_enqueue() with OSv wake_with() - adapt references to the mbuf fields to match OSv version of it (please freebsd/freebsd-src@3d1a9ed commit that changed the layout of mbuf struct a bit) - simplify ena_tx_map_mbuf() given we hard-code to use ENA_ADMIN_PLACEMENT_POLICY_HOST TX queue type and do not use bus_dma API (see https://man.freebsd.org/cgi/man.cgi?query=bus_dma) Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch adapts the admin/setup header ena.h to OSv. In particular it addresses following: - import atomic bitset support from FreeBSD tree (see https://github.com/freebsd/freebsd-src/blob/main/sys/sys/_bitset.h) - remove unnecessary fields from ena_adapter struct - replace the IRQ related fields with OSv equivalent (see ena_irq) - replace cleanup_task and cleanup_tq in ena_qeu struct with OSv equivalent cleanup_thread - replace enqueue_task and enqueue_tq in ena_ring struct with OSv equivalent enqueue_thread - remove RSS and DEV_NETMAP artifacts - for now define counter_* macros to disable related functionality - replace callout_reset_sbt() with equivalent callout_reset() Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch adapts the middle layer of the admin and device setup/teardown handling logic to make it work in OSv. It is also the last patch to complete the porting work of FreeBSD ena driver code to work in OSv. The code in ena.cc mostly implements the logic to probe, attach and detach the device and involves interacting with lower-level admin API of ena_com/ena_com.cc to submit commands to Admin Queue (AQ) and receive and process completions from Admin Completion Queue (ACQ). It also implements interrupt handlers and worker threads to process I/O. For more details read https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#management-interface. In particular this patch addresses following: - change FreeBSD header include paths to match OSv source tree - eliminate most DMA-related functions ena_*dma_*() - eliminate metrics task code for now - eliminate LLQ, RSS and DEV_NETMAP related code - deactivate counters (aka statistics collection) code - rewrite ena_dma_alloc() to use OSv memory::alloc_phys_contiguous_aligned() and mmu::virt_to_phys() (it probably should not have *dma* in name) - rewrite the functions that setup MSI/X and implement other PCI-related functionality to use OSv PCI code from drivers/pci-* and arch/*/msi.** - ena_free_pci_resources(), ena_probe(), ena_enable_msix(), ena_setup_mgmnt_intr(), ena_setup_io_intr(), ena_request_mgmnt_irq(), ena_request_io_irq(), ena_free_io_irq(), ena_disable_msix() - replace the calls to drbr_*() functions with buf_ring_*() equivalent ones - implement the main function of the enqueue worker thread - enqueue_work(); this function is used when setting TX resource in ena_setup_tx_resources() and replaces FreeBSD version of it - enqueue_tq and enqueue_task - simplify ena_alloc_rx_mbuf() by mostly not using the DMA-related code - eliminate ena_update_buf_ring_size(), ena_update_queue_size(), ena_update_io_rings(), ena_update_io_queue_nb() which are not needed as OSv will not support changing ring and queue size (see for example https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#size-of-the-tx-buffer-ring-drbr) through ioctl() - simplify ena_ioctl() - implement the main function of the cleanup worker thread - cleanup_work(); this function is used when setting I/O queues in ena_create_io_queues() and replaces FreeBSD version of it - cleanup_tq and cleanup_task - adjust CSUM_* constant to match the version of OSv version of FreeBSD headers - replace if_set*() function calls with equivalent code directly setting fields of if_t structure (for example if_settransmitfn(ifp, ena_mq_start) => ifp->if_transmit = ena_mq_start) - hardcode TX queue memory type to ENA_ADMIN_PLACEMENT_POLICY_HOST (we do not support LLQ) - eliminate LLQ-related code - ena_map_llq_mem_bar(), set_default_llq_configurations() and any ifs testing ENA_ADMIN_PLACEMENT_POLICY_DEV - adapt code reading current boot time to use osv::clock::uptime::now() - adapt ena_handle_msix() and other places to use OSV wake_with_irq_or_preemption_disabled() instead of taskqueue_enqueue() - add remaining *cc files to the Makefile - everything should compile Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This almost final patch implements a very upper "thin" layer in form of the aws::ena driver class that subclasses from hw_driver. The contructor, destructor and probe() merely delegate to functions ena_attach(), ena_detach() and ena_probe() respectively implemented in bsd/sys/dev/ena/ena.cc. Please note that some of the statistics functionality (see fill_stats()) and if_getinfo are left unimplemented for now. Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This last patch improves certain aspects of the driver implementation: - completes LRO handling - adds number of tracepoints to help trubleshoot and analaze performance - pins cleanup worker thread and corresponding MSIX vector to a cpu Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
wkozaczuk
force-pushed
the
aws_ena_pr
branch
from
December 10, 2023 05:08
b6c22f8
to
71f8036
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request implements the AWS ena driver by porting the FreeBSD version from https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena.
The objective of this porting exercise is to adapt the FreeBSD code to make it work in OSv and at the same time minimize changes so that we can backport any potential bug fixes or enhancements in the future. On top of it, we also reduce the code footprint by eliminating features that are either not relevant to OSv or not needed at this point (for example RSS). The resulting driver does NOT implement the following features:
sysctl
andioctl
functionalityEven though this driver implements stateless offloads - TXCSUM, RXCSUM, TSO, LRO - (just like the original FreeBSD one - https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#stateless-offloads), the underlying ENA device does NOT implement RXCSUM nor TSO (see amzn/amzn-drivers#29). It also looks like the LRO logic never gets activated based on the observed values of relevant tracepoints.
The details of the changes are explained in each commit that is part of this PR.
The design of the driver is 3-layered:
bsd/sys/contrib/ena_com
part of the source treebsd/sys/dev/ena
part of the source treedrivers/ena.*
The resulting driver "costs" us ~7k lines of mostly C code and ~56K larger kernel binary size.
This implementation is functional and has been tested on an actual Nitro EC2 instance (t3 nano only for now) and seems to be stable. The preliminary stress tests suggest that OSv instance with a simple hello world golang http server can handle ~ 45-50K requests per second:
Connecting with OSv cli and executing top shows the following thread dump with 4 ena threads -
cleanup
andenqueue
for each vCPU:Compared to the initial version of the PR, this one adds new tracepoints and pins cleanup worker threads, and corresponding MSIX vectors to the same CPU. Pinning the worker threads minimizes the number of IPIs and seems to improve performance by 5-10% at least based on one of the tests conducted.
One can also connect to the running OSv instance serial console from AWS web console.
To run OSv on Nitro instance without NVMe we build a ramfs-based image like so:
It can be then deployed to AWS as AMI by executing the following:
The script above also creates a stack with a single EC2 instance running the image. For more details please read this commit comments - 873cb55
Closes #1204