Skip to content

Commit 3226e31

Browse files
author
Alexei Starovoitov
committed
Merge branch 'xsk-multi-buffer-support'
Maciej Fijalkowski says: ==================== xsk: multi-buffer support v6->v7: - rebase...[Alexei] v5->v6: - update bpf_xdp_query_opts__last_field in patch 10 [Alexei] v4->v5: - align options argument size to match options from xdp_desc [Benjamin] - cleanup skb from xdp_sock on socket termination [Toke] - introduce new netlink attribute for letting user space know about Tx frag limit; this substitutes xdp_features flag previously dedicated for setting ZC multi-buffer support [Toke, Jakub] - include i40e ZC multi-buffer support - enable TOO_MANY_FRAGS for ZC on xskxceiver; this is now possible due to netlink attribute mentioned two bullets above v3->v4: -rely on ynl for adding new xdp_features flag [Jakub] - move xskb_list to xsk_buff_pool v2->v3: - Fix issue with the next valid packet getting dropped after an invalid packet with MAX_SKB_FRAGS + 1 frags [Magnus] - query NETDEV_XDP_ACT_ZC_SG flag within xskxceiver and act on it - remove redundant include in xsk.c [kernel test robot] - s/NETDEV_XDP_ACT_NDO_ZC_SG/NETDEV_XDP_ACT_ZC_SG + kernel doc [Magnus, Simon] v1->v2: - fix spelling issues in commit messages [Simon] - remove XSK_DESC_MAX_FRAGS, use MAX_SKB_FRAGS instead [Stan, Alexei] - add documentation patch - fix build error from kernel test robot on patch 10 This series of patches add multi-buffer support for AF_XDP. XDP and various NIC drivers already have support for multi-buffer packets. With this patch set, programs using AF_XDP sockets can now also receive and transmit multi-buffer packets both in copy as well as zero-copy mode. ZC multi-buffer implementation is based on ice driver. Some definitions to put us all on the same page: * A packet consists of one or more frames * A descriptor in one of the AF_XDP rings always refers to a single frame. In the case the packet consists of a single frame, the descriptor refers to the whole packet. To represent a packet consisting of multiple frames, we introduce a new flag called XDP_PKT_CONTD in the options field of the Rx and Tx descriptors. If it is true (1) the packet continues with the next descriptor and if it is false (0) it means this is the last descriptor of the packet. Why the reverse logic of end-of-packet (eop) flag found in many NICs? Just to preserve compatibility with non-multi-buffer applications that have this bit set to false for all packets on Rx, and the apps set the options field to zero for Tx, as anything else will be treated as an invalid descriptor. These are the semantics for producing packets onto XSK Tx ring consisting of multiple frames: * When an invalid descriptor is found, all the other descriptors/frames of this packet are marked as invalid and not completed. The next descriptor is treated as the start of a new packet, even if this was not the intent (because we cannot guess the intent). As before, if your program is producing invalid descriptors you have a bug that must be fixed. * Zero length descriptors are treated as invalid descriptors. * For copy mode, the maximum supported number of frames in a packet is equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all descriptors accumulated so far are dropped and treated as invalid. To produce an application that will work on any system regardless of this config setting, limit the number of frags to 18, as the minimum value of the config is 17. * For zero-copy mode, the limit is up to what the NIC HW supports. User space can discover this via newly introduced NETDEV_A_DEV_XDP_ZC_MAX_SEGS netlink attribute. Here is an example Tx path pseudo-code (using libxdp interfaces for simplicity) ignoring that the umem is finite in size, and that we eventually will run out of packets to send. Also assumes pkts.addr points to a valid location in the umem. void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, int batch_size) { u32 idx, i, pkt_nb = 0; xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); for (i = 0; i < batch_size;) { u64 addr = pkts[pkt_nb].addr; u32 len = pkts[pkt_nb].size; do { struct xdp_desc *tx_desc; tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); tx_desc->addr = addr; if (len > xsk_frame_size) { tx_desc->len = xsk_frame_size; tx_desc->options |= XDP_PKT_CONTD; } else { tx_desc->len = len; tx_desc->options = 0; pkt_nb++; } len -= tx_desc->len; addr += xsk_frame_size; if (i == batch_size) { /* Remember len, addr, pkt_nb for next * iteration. Skipped for simplicity. */ break; } } while (len); } xsk_ring_prod__submit(&xsk->tx, i); } On the Rx path in copy mode, the xsk core copies the XDP data into multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as detailed before. Zero-copy mode in order to avoid the copies has to maintain a chain of xdp_buff_xsk structs that represent whole packet. This is because what actually is redirected is the xdp_buff and we currently have no equivalent mechanism that is used for copy mode (embedded skb_shared_info in xdp_buff) to carry the frags. This means xdp_buff_xsk grows in size but these members are at the end and should not be touched when data path is not dealing with fragmented packets. This solution kept us within assumed performance impact, hence we decided to proceed with it. When the application gets a descriptor with the XDP_PKT_CONTD flag set to one, it means that the packet consists of multiple buffers and it continues with the next buffer in the following descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it means that this is the last buffer of the packet. AF_XDP guarantees that only a complete packet (all frames in the packet) is sent to the application. If application reads a batch of descriptors, using for example the libxdp interfaces, it is not guaranteed that the batch will end with a full packet. It might end in the middle of a packet and the rest of the buffers of that packet will arrive at the beginning of the next batch, since the libxdp interface does not read the whole ring (unless you have an enormous batch size or a very small ring size). Here is a simple Rx path pseudo-code example (using libxdp interfaces for simplicity). Error paths have been excluded for simplicity: void rx_packets(struct xsk_socket_info *xsk) { static bool new_packet = true; u32 idx_rx = 0, idx_fq = 0; static char *pkt; int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); for (int i = 0; i < rcvd; i++) { struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); bool eop = !(desc->options & XDP_PKT_CONTD); if (new_packet) pkt = frag; else add_frag_to_pkt(pkt, frag); if (eop) process_pkt(pkt); new_packet = eop; *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; } xsk_ring_prod__submit(&xsk->umem->fq, rcvd); xsk_ring_cons__release(&xsk->rx, rcvd); } We had to introduce a new bind flag (XDP_USE_SG) on the AF_XDP level to enable multi-buffer support. The reason we need to differentiate between non multi-buffer and multi-buffer is the behaviour when the kernel gets a packet that is larger than the frame size. Without multi-buffer, this packet is dropped and marked in the stats. With multi-buffer on, we want to split it up into multiple frames instead. At the start, we thought that riding on the .frags section name of the XDP program was a good idea. You do not have to introduce yet another flag and all AF_XDP users must load an XDP program anyway to get any traffic up to the socket, so why not just say that the XDP program decides if the AF_XDP socket should get multi-buffer packets or not? The problem is that we can create an AF_XDP socket that is Tx only and that works without having to load an XDP program at all. Another problem is that the XDP program might change during the execution, so we would have to check this for every single packet. Here is the observed throughput when compared to a codebase without any multi-buffer changes and measured with xdpsock for 64B packets. Apparently ZC Tx takes a hit from explicit zero length descriptors validation. Overall, in terms of ZC performance, there is a room for improvement, but for now we think this work is in a good shape in terms of correctness and functionality. We were targetting for up to 5% overhead though. Note that ZC performance drops come from core + driver support being combined, whereas copy mode had already driver support in place. Mode rxdrop l2fwd txonly ice-zc -4% -7% -6% i40e-zc -7% -6% -7% drv -1.2% 0% +2% skb -0.6% -1% +2% Thank you, Tirthendu, Magnus and Maciej ==================== Link: https://lore.kernel.org/r/20230719132421.584801-1-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 parents 492e797 + 3666bcc commit 3226e31

File tree

32 files changed

+1505
-275
lines changed

32 files changed

+1505
-275
lines changed

Documentation/netlink/specs/netdev.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,12 @@ attribute-sets:
6262
type: u64
6363
enum: xdp-act
6464
enum-as-flags: true
65+
-
66+
name: xdp_zc_max_segs
67+
doc: max fragment count supported by ZC driver
68+
type: u32
69+
checks:
70+
min: 1
6571

6672
operations:
6773
list:

Documentation/networking/af_xdp.rst

Lines changed: 210 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -462,8 +462,92 @@ XDP_OPTIONS getsockopt
462462
Gets options from an XDP socket. The only one supported so far is
463463
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
464464

465+
Multi-Buffer Support
466+
====================
467+
468+
With multi-buffer support, programs using AF_XDP sockets can receive
469+
and transmit packets consisting of multiple buffers both in copy and
470+
zero-copy mode. For example, a packet can consist of two
471+
frames/buffers, one with the header and the other one with the data,
472+
or a 9K Ethernet jumbo frame can be constructed by chaining together
473+
three 4K frames.
474+
475+
Some definitions:
476+
477+
* A packet consists of one or more frames
478+
479+
* A descriptor in one of the AF_XDP rings always refers to a single
480+
frame. In the case the packet consists of a single frame, the
481+
descriptor refers to the whole packet.
482+
483+
To enable multi-buffer support for an AF_XDP socket, use the new bind
484+
flag XDP_USE_SG. If this is not provided, all multi-buffer packets
485+
will be dropped just as before. Note that the XDP program loaded also
486+
needs to be in multi-buffer mode. This can be accomplished by using
487+
"xdp.frags" as the section name of the XDP program used.
488+
489+
To represent a packet consisting of multiple frames, a new flag called
490+
XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
491+
descriptors. If it is true (1) the packet continues with the next
492+
descriptor and if it is false (0) it means this is the last descriptor
493+
of the packet. Why the reverse logic of end-of-packet (eop) flag found
494+
in many NICs? Just to preserve compatibility with non-multi-buffer
495+
applications that have this bit set to false for all packets on Rx,
496+
and the apps set the options field to zero for Tx, as anything else
497+
will be treated as an invalid descriptor.
498+
499+
These are the semantics for producing packets onto AF_XDP Tx ring
500+
consisting of multiple frames:
501+
502+
* When an invalid descriptor is found, all the other
503+
descriptors/frames of this packet are marked as invalid and not
504+
completed. The next descriptor is treated as the start of a new
505+
packet, even if this was not the intent (because we cannot guess
506+
the intent). As before, if your program is producing invalid
507+
descriptors you have a bug that must be fixed.
508+
509+
* Zero length descriptors are treated as invalid descriptors.
510+
511+
* For copy mode, the maximum supported number of frames in a packet is
512+
equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
513+
descriptors accumulated so far are dropped and treated as
514+
invalid. To produce an application that will work on any system
515+
regardless of this config setting, limit the number of frags to 18,
516+
as the minimum value of the config is 17.
517+
518+
* For zero-copy mode, the limit is up to what the NIC HW
519+
supports. Usually at least five on the NICs we have checked. We
520+
consciously chose to not enforce a rigid limit (such as
521+
CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
522+
resulted in copy actions under the hood to fit into what limit the
523+
NIC supports. Kind of defeats the purpose of zero-copy mode. How to
524+
probe for this limit is explained in the "probe for multi-buffer
525+
support" section.
526+
527+
On the Rx path in copy-mode, the xsk core copies the XDP data into
528+
multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
529+
detailed before. Zero-copy mode works the same, though the data is not
530+
copied. When the application gets a descriptor with the XDP_PKT_CONTD
531+
flag set to one, it means that the packet consists of multiple buffers
532+
and it continues with the next buffer in the following
533+
descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
534+
means that this is the last buffer of the packet. AF_XDP guarantees
535+
that only a complete packet (all frames in the packet) is sent to the
536+
application. If there is not enough space in the AF_XDP Rx ring, all
537+
frames of the packet will be dropped.
538+
539+
If application reads a batch of descriptors, using for example the libxdp
540+
interfaces, it is not guaranteed that the batch will end with a full
541+
packet. It might end in the middle of a packet and the rest of the
542+
buffers of that packet will arrive at the beginning of the next batch,
543+
since the libxdp interface does not read the whole ring (unless you
544+
have an enormous batch size or a very small ring size).
545+
546+
An example program each for Rx and Tx multi-buffer support can be found
547+
later in this document.
548+
465549
Usage
466-
=====
550+
-----
467551

468552
In order to use AF_XDP sockets two parts are needed. The
469553
user-space application and the XDP program. For a complete setup and
@@ -541,6 +625,131 @@ like this:
541625
But please use the libbpf functions as they are optimized and ready to
542626
use. Will make your life easier.
543627

628+
Usage Multi-Buffer Rx
629+
---------------------
630+
631+
Here is a simple Rx path pseudo-code example (using libxdp interfaces
632+
for simplicity). Error paths have been excluded to keep it short:
633+
634+
.. code-block:: c
635+
636+
void rx_packets(struct xsk_socket_info *xsk)
637+
{
638+
static bool new_packet = true;
639+
u32 idx_rx = 0, idx_fq = 0;
640+
static char *pkt;
641+
642+
int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
643+
644+
xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
645+
646+
for (int i = 0; i < rcvd; i++) {
647+
struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
648+
char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
649+
bool eop = !(desc->options & XDP_PKT_CONTD);
650+
651+
if (new_packet)
652+
pkt = frag;
653+
else
654+
add_frag_to_pkt(pkt, frag);
655+
656+
if (eop)
657+
process_pkt(pkt);
658+
659+
new_packet = eop;
660+
661+
*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
662+
}
663+
664+
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
665+
xsk_ring_cons__release(&xsk->rx, rcvd);
666+
}
667+
668+
Usage Multi-Buffer Tx
669+
---------------------
670+
671+
Here is an example Tx path pseudo-code (using libxdp interfaces for
672+
simplicity) ignoring that the umem is finite in size, and that we
673+
eventually will run out of packets to send. Also assumes pkts.addr
674+
points to a valid location in the umem.
675+
676+
.. code-block:: c
677+
678+
void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
679+
int batch_size)
680+
{
681+
u32 idx, i, pkt_nb = 0;
682+
683+
xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
684+
685+
for (i = 0; i < batch_size;) {
686+
u64 addr = pkts[pkt_nb].addr;
687+
u32 len = pkts[pkt_nb].size;
688+
689+
do {
690+
struct xdp_desc *tx_desc;
691+
692+
tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
693+
tx_desc->addr = addr;
694+
695+
if (len > xsk_frame_size) {
696+
tx_desc->len = xsk_frame_size;
697+
tx_desc->options = XDP_PKT_CONTD;
698+
} else {
699+
tx_desc->len = len;
700+
tx_desc->options = 0;
701+
pkt_nb++;
702+
}
703+
len -= tx_desc->len;
704+
addr += xsk_frame_size;
705+
706+
if (i == batch_size) {
707+
/* Remember len, addr, pkt_nb for next iteration.
708+
* Skipped for simplicity.
709+
*/
710+
break;
711+
}
712+
} while (len);
713+
}
714+
715+
xsk_ring_prod__submit(&xsk->tx, i);
716+
}
717+
718+
Probing for Multi-Buffer Support
719+
--------------------------------
720+
721+
To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
722+
mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
723+
query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
724+
querying for XDP multi-buffer support. If XDP supports multi-buffer in
725+
a driver, then AF_XDP will also support that in SKB and DRV mode.
726+
727+
To discover if a driver supports multi-buffer AF_XDP in zero-copy
728+
mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
729+
flag. If it is set, it means that at least zero-copy is supported and
730+
you should go and check the netlink attribute
731+
NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
732+
value will be returned stating the max number of frags that are
733+
supported by this device in zero-copy mode. These are the possible
734+
return values:
735+
736+
1: Multi-buffer for zero-copy is not supported by this device, as max
737+
one fragment supported means that multi-buffer is not possible.
738+
739+
>=2: Multi-buffer is supported in zero-copy mode for this device. The
740+
returned number signifies the max number of frags supported.
741+
742+
For an example on how these are used through libbpf, please take a
743+
look at tools/testing/selftests/bpf/xskxceiver.c.
744+
745+
Multi-Buffer Support for Zero-Copy Drivers
746+
------------------------------------------
747+
748+
Zero-copy drivers usually use the batched APIs for Rx and Tx
749+
processing. Note that the Tx batch API guarantees that it will provide
750+
a batch of Tx descriptors that ends with full packet at the end. This
751+
to facilitate extending a zero-copy driver with multi-buffer support.
752+
544753
Sample application
545754
==================
546755

drivers/net/ethernet/intel/i40e/i40e_main.c

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3585,11 +3585,6 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
35853585
if (ring->xsk_pool) {
35863586
ring->rx_buf_len =
35873587
xsk_pool_get_rx_frame_size(ring->xsk_pool);
3588-
/* For AF_XDP ZC, we disallow packets to span on
3589-
* multiple buffers, thus letting us skip that
3590-
* handling in the fast-path.
3591-
*/
3592-
chain_len = 1;
35933588
ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
35943589
MEM_TYPE_XSK_BUFF_POOL,
35953590
NULL);
@@ -13822,6 +13817,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
1382213817
NETDEV_XDP_ACT_REDIRECT |
1382313818
NETDEV_XDP_ACT_XSK_ZEROCOPY |
1382413819
NETDEV_XDP_ACT_RX_SG;
13820+
netdev->xdp_zc_max_segs = I40E_MAX_BUFFER_TXD;
1382513821
} else {
1382613822
/* Relate the VSI_VMDQ name to the VSI_MAIN name. Note that we
1382713823
* are still limited by IFNAMSIZ, but we're adding 'v%d\0' to

drivers/net/ethernet/intel/i40e/i40e_txrx.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2284,8 +2284,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
22842284
* If the buffer is an EOP buffer, this function exits returning false,
22852285
* otherwise return true indicating that this is in fact a non-EOP buffer.
22862286
*/
2287-
static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
2288-
union i40e_rx_desc *rx_desc)
2287+
bool i40e_is_non_eop(struct i40e_ring *rx_ring,
2288+
union i40e_rx_desc *rx_desc)
22892289
{
22902290
/* if we are the last buffer then there is nothing else to do */
22912291
#define I40E_RXD_EOF BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)

drivers/net/ethernet/intel/i40e/i40e_txrx.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -473,6 +473,8 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
473473
bool __i40e_chk_linearize(struct sk_buff *skb);
474474
int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
475475
u32 flags);
476+
bool i40e_is_non_eop(struct i40e_ring *rx_ring,
477+
union i40e_rx_desc *rx_desc);
476478

477479
/**
478480
* i40e_get_head - Retrieve head from head writeback

0 commit comments

Comments
 (0)