idpf: fix error handling in soft_reset for XDP #26

michalQb · 2024-08-19T16:38:29Z

The commit 35d653a ("idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq") uses soft_reset to perform a full vport reconfiguration after changes in XDP setup.
Unfortunately, the soft_reset may fail after attaching the XDP program to the vport. It can happen when the HW limits resources that can be allocated to fulfill XDP requirements.
In such a case, before we return an error from XDP_SETUP_PROG, we have to fully restore the previous vport state, including removing the XDP program.

In order to remove the already loaded XDP program in case of reset error, re-implement the error handling path and move some calls to the XDP callback.

Fixes: 35d653a ("idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq")

Make dev->priv_flags `u32` back and define bits higher than 31 as bitfield booleans as per Jakub's suggestion. This simplifies code which accesses these bits with no optimization loss (testb both before/after), allows to not extend &netdev_priv_flags each time, but also scales better as bits > 63 in the future would only add a new u64 to the structure with no complications, comparing to that extending ::priv_flags would require converting it to a bitmap. Note that I picked `unsigned long :1` to not lose any potential optimizations comparing to `bool :1` etc. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

NETIF_F_NO_CSUM was removed in 3.2-rc2 by commit 34324dc ("net: remove NETIF_F_NO_CSUM feature bit") and became __UNUSED_NETIF_F_1. It's not used anywhere in the code. Remove this bit waste. It wasn't needed to rename the flag instead of removing it as netdev features are not uAPI/ABI. Ethtool passes their names and values separately with no fixed positions and the userspace Ethtool code doesn't have any hardcoded feature names/bits, so that new Ethtool will work on older kernels and vice versa. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

NETIF_F_LLTX can't be changed via Ethtool and is not a feature, rather an attribute, very similar to IFF_NO_QUEUE (and hot). Free one netdev_features_t bit and make it a "hot" private flag. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

"Interface can't change network namespaces" is rather an attribute, not a feature, and it can't be changed via Ethtool. Make it a "cold" private flag instead of a netdev_feature and free one more bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Ability to handle maximum FCoE frames of 2158 bytes can never be changed and thus more of an attribute, not a toggleable feature. Move it from netdev_features_t to "cold" priv flags (bitfield bool) and free yet another feature bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

NETIF_F_ALL_FCOE is used only in vlan_dev.c, 2 times. Now that it's only 2 bits, open-code it and remove the definition from netdev_features.h. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

…pu() kthread_create_on_cpu() always requires format string to contain one '%u' at the end, as it automatically adds the CPU ID when passing it to kthread_create_on_node(). The former doesn't marked as __printf() as it's not printf-like itself, which effectively hides this from the compiler. If you convert this function to printf-like, you'll see the following: In file included from drivers/firmware/psci/psci_checker.c:15: drivers/firmware/psci/psci_checker.c: In function 'suspend_tests': drivers/firmware/psci/psci_checker.c:401:48: warning: too many arguments for format [-Wformat-extra-args] 401 | "psci_suspend_test"); | ^~~~~~~~~~~~~~~~~~~ drivers/firmware/psci/psci_checker.c:400:32: warning: data argument not used by format string [-Wformat-extra-args] 400 | (void *)(long)cpu, cpu, | ^ 401 | "psci_suspend_test"); | ~~~~~~~~~~~~~~~~~~~ Add the missing format literal to fix this. Now the corresponding kthread will be named as "psci_suspend_test-<cpuid>", as it's meant by kthread_create_on_cpu(). Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408141012.KhvKaxoh-lkp@intel.com Closes: https://lore.kernel.org/oe-kbuild-all/202408141243.eQiEOQQe-lkp@intel.com Fixes: ea8b1c4 ("drivers: psci: PSCI checker module") Cc: stable@vger.kernel.org # 4.10+ Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Currently, kthread_{create,run}_on_cpu() doesn't support varargs like kthread_create{,_on_node}() do, which makes them less convenient to use. Convert them to take varargs as the last argument. The only difference is that they always append the CPU ID at the end and require the format string to have an excess '%u' at the end due to that. That's still true; meanwhile, the compiler will correctly point out to that if missing. One more nice side effect is that you can now use the underscored __kthread_create_on_cpu() if you want to override that rule and not have CPU ID at the end of the name. The current callers are not anyhow affected. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Add a function to get an array of skbs from the NAPI percpu cache. It's supposed to be a drop-in replacement for kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the requirement to call it only from the BH) is that it tries to use as many NAPI cache entries for skbs as possible, and allocate new ones only if needed. It can save significant amounts of CPU cycles if there are GRO cycles and/or Tx completion cycles (anything that descends to napi_skb_cache_put()) happening on this CPU. If the function is not able to provide the requested number of entries due to an allocation error, it returns as much as it got. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Now that cpumap uses GRO, which drops unused skb heads to the NAPI cache, use napi_skb_cache_get_bulk() to try to reuse cached entries and lower the MM layer pressure. In the situation when all 8 skbs from the first cpumap batch goes into one GRO skb and the rest 7 go into the cache, there will now be only 1 skb to allocate next time instead of 8. 16 skbs will be allocated to the cache and there will be no allocations until the end of one NAPI polling loop. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

There are cases when we need to explicitly unroll loops. For example, cache operations, filling DMA descriptors on very high speeds etc. Add compiler-specific attribute macros to give the compiler a hint that we'd like to unroll a loop. Example usage: #define UNROLL_BATCH 8 unrolled_count(UNROLL_BATCH) for (u32 i = 0; i < UNROLL_BATCH; i++) op(priv, i); Note that sometimes the compilers won't unroll loops if they think this would have worse optimization and perf than without unrolling, and that unroll attributes are available only starting GCC 8. For older compiler versions, no hints/attributes will be applied. For better unrolling/parallelization, don't have any variables that interfere between iterations except for the iterator itself. Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Define common structures, inline helpers and Ethtool helpers to collect, update and export the statistics (RQ, SQ, XDPSQ). Use u64_stats_t right from the start, as well as the corresponding helpers to ensure tear-free operations. For the NAPI parts of both Rx and Tx, also define small onstack containers to update them in polling loops and then sync the actual containers once a loop ends. In order to implement fully generic Netlink per-queue stats callbacks, &libeth_netdev_priv is introduced and is required to be embedded at the start of the driver's netdev_priv structure. Note on the stats types: * onstack stats are u32 and are local to one NAPI loop or sending function. At the end of processing, they get added to the "live" stats below; * "live" stats are u64_stats_t and they reflect the current session (interface up) only (Netlink queue stats). When an ifdown occurs, they get added to the "base" stats below; * "base" stats are u64 guarded by a mutex, they survive ifdowns and don't get updated when the interface is up. This corresponds to the Netlink base stats. Drivers are responsible for filling the onstack stats and calling stack -> live update functions; base stats are internal to libeth. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Software-side Tx buffers for storing DMA, frame size, skb pointers etc. are pretty much generic and every driver defines them the same way. The same can be said for software Tx completions -- same napi_consume_skb()s and all that... Add a couple simple wrappers for doing that to stop repeating the old tale at least within the Intel code. Drivers are free to use 'priv' member at the end of the structure. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

&idpf_tx_buffer is almost identical to the previous generations, as well as the way it's handled. Moreover, relying on dma_unmap_addr() and !!buf->skb instead of explicit defining of buffer's type was never good. Use the newly added libeth helpers to do it properly and reduce the copy-paste around the Tx code. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Add a shorthand similar to other net*_subqueue() helpers for resetting the queue by its index w/o obtaining &netdev_tx_queue beforehand manually. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Add a mechanism to guard against stashing partial packets into the hash table to make the driver more robust, with more efficient decision making when cleaning. Don't stash partial packets. This can happen when an RE (Report Event) completion is received in flow scheduling mode, or when an out of order RS (Report Status) completion is received. The first buffer with the skb is stashed, but some or all of its frags are not because the stack is out of reserve buffers. This leaves the ring in a weird state since the frags are still on the ring. Use the field libeth_sqe::nr_frags to track the number of fragments/tx_bufs representing the packet. The clean routines check to make sure there are enough reserve buffers on the stack before stashing any part of the packet. If there are not, next_to_clean is left pointing to the first buffer of the packet that failed to be stashed. This leaves the whole packet on the ring, and the next time around, cleaning will start from this packet. An RS completion is still expected for this packet in either case. So instead of being cleaned from the hash table, it will be cleaned from the ring directly. This should all still be fine since the DESC_UNUSED and BUFS_UNUSED will reflect the state of the ring. If we ever fall below the thresholds, the TxQ will still be stopped, giving the completion queue time to catch up. This may lead to stopping the queue more frequently, but it guarantees the Tx ring will always be in a good state. Also, always use the idpf_tx_splitq_clean function to clean descriptors, i.e. use it from clean_buf_ring as well. This way we avoid duplicating the logic and make sure we're using the same reserve buffers guard rail. This does require a switch from the s16 next_to_clean overflow descriptor ring wrap calculation to u16 and the normal ring size check. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

netif_txq_maybe_stop() returns -1, 0, or 1, while idpf_tx_maybe_stop_common() says it returns 0 or -EBUSY. As a result, there sometimes are Tx queue timeout warnings despite that the queue is empty or there is at least enough space to restart it. Make idpf_tx_maybe_stop_common() inline and returning true or false, handling the return of netif_txq_maybe_stop() properly. Use a correct goto in idpf_tx_maybe_stop_splitq() to avoid stopping the queue or incrementing the stops counter twice. Fixes: 6818c4d ("idpf: add splitq start_xmit") Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Tell hardware to write back completed descriptors even when interrupts are disabled. Otherwise, descriptors might not be written back until the hardware can flush a full cacheline of descriptors. This can cause unnecessary delays when traffic is light (or even trigger Tx queue timeout). The example scenario to reproduce the Tx timeout if the fix is not applied: - configure at least 2 Tx queues to be assigned to the same q_vector, - generate a huge Tx traffic on the first Tx queue - try to send a few packets using the second Tx queue. In such a case Tx timeout will appear on the second Tx queue because no completion descriptors are written back for that queue while interrupts are disabled due to NAPI polling. Fixes: c2d548c ("idpf: add TX splitq napi poll support") Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Fully reimplement idpf's per-queue stats using the libeth infra. Embed &libeth_netdev_priv to the beginning of &idpf_netdev_priv(), call the necessary init/deinit helpers and the corresponding Ethtool helpers. Update hotpath counters such as hsplit and tso/gso using the onstack containers instead of direct accesses to queue->stats. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Lots of read-only helpers for &xdp_buff and &xdp_frame, such as getting the frame length, skb_shared_info etc., don't have their arguments marked with `const` for no reason. Add the missing annotations to leave less place for mistakes and more for optimization. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

One may need to register memory model separately from xdp_rxq_info. One simple example may be XDP test run code, but in general, it might be useful when memory model registering is managed by one layer and then XDP RxQ info by a different one. Allow such scenarios by adding a simple helper which "attaches" an already registered memory model to the desired xdp_rxq_info. As this is mostly needed for Page Pool, add a special function to do that for a &page_pool pointer. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

To make the system page pool usable as a source for allocating XDP frames, we need to register it with xdp_reg_mem_model(), so that page return works correctly. This is done in preparation for using the system page pool for the XDP live frame mode in BPF_TEST_RUN; for the same reason, make the per-cpu variable non-static so we can access it from the test_run code as well. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

Currently, page_pool_put_page_bulk() indeed takes an array of pointers to the data, not pages, despite the name. As one side effect, when you're freeing frags from &skb_shared_info, xdp_return_frame_bulk() converts page pointers to virtual addresses and then page_pool_put_page_bulk() converts them back. Make page_pool_put_page_bulk() actually handle array of pages. Pass frags directly and use virt_to_page() when freeing xdpf->data, so that the PP core will then get the compound head and take care of the rest. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The main reason for this change was to allow mixing pages from different &page_pools within one &xdp_buff/&xdp_frame. Why not? Adjust xdp_return_frame_bulk() and page_pool_put_page_bulk(), so that they won't be tied to a particular pool. Let the latter splice the bulk when it encounters a page whichs PP is different and flush it recursively. This greatly optimizes xdp_return_frame_bulk(): no more hashtable lookups. Also make xdp_flush_frame_bulk() inline, as it's just one if + function call + one u32 read, not worth extending the call ladder. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Initially, xdp_frame::mem.id was used to search for the corresponding &page_pool to return the page correctly. However, after that struct page now contains a direct pointer to its PP, further keeping of this field makes no sense. xdp_return_frame_bulk() still uses it to do a lookup, but this is rather a leftover. Remove xdp_frame::mem and replace it with ::mem_type, as only memory type still matters and we need to know it to be able to free the frame correctly. As a cute side effect, we can now make every scalar field in &xdp_frame of 4 byte width, speeding up accesses to them. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The code piece which would attach a frag to &xdp_buff is almost identical across the drivers supporting XDP multi-buffer on Rx. Make it a generic elegant onelner. Also, I see lots of drivers calculating frags_truesize as `xdp->frame_sz * nr_frags`. I can't say this is fully correct, since frags might be backed by chunks of different sizes, especially with stuff like the header split. Even page_pool_alloc() can give you two different truesizes on two subsequent requests to allocate the same buffer size. Add a field to &skb_shared_info (unionized as there's no free slot currently on x6_64) to track the "true" truesize. It can be used later when updating an skb. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Same as with converting &xdp_buff to skb on Rx, the code which allocates a new skb and copies the XSk frame there is identical across the drivers, so make it generic. This includes copying all the frags if they are present in the original buff. System percpu Page Pools help here a lot: when available, allocate pages from there instead of the MM layer. This greatly improves XDP_PASS performance on XSk: instead of page_alloc() + page_free(), the net core recycles the same pages, so the only overhead left is memcpy()s. Note that the passed buff gets freed if the conversion is done w/o any error, assuming you don't need this buffer after you convert it to an skb. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Currently, when you send an XSk frame without metadata, you need to do the following: * call external xsk_buff_raw_get_dma(); * call inline xsk_buff_get_metadata(), which calls external xsk_buff_raw_get_data() and then do some inline checks. This effectively means that the following piece: addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr; is done twice per frame, plus you have 2 external calls per frame, plus this: meta = pool->addrs + addr - pool->tx_metadata_len; if (unlikely(!xsk_buff_valid_tx_metadata(meta))) is always inlined, even if there's no meta or it's invalid. Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that in one go. It returns a small structure with 2 fields: DMA address, filled unconditionally, and metadata pointer, valid only if it's present. The address correction is performed only once and you also have only 1 external call per XSk frame, which does all the calculations and checks outside of your hotpath. You only need to check `if (ctx.meta)` for the metadata presence. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE) is repeated across dozens of drivers and really wants a shorthand. Add a macro which will count args and handle all possible number from 2 to 5. Semantics: skb_frag_dma_map(dev, frag) -> __skb_frag_dma_map(dev, frag, 0, skb_frag_size(frag), DMA_TO_DEVICE) skb_frag_dma_map(dev, frag, offset) -> __skb_frag_dma_map(dev, frag, offset, skb_frag_size(frag) - offset, DMA_TO_DEVICE) skb_frag_dma_map(dev, frag, offset, size) -> __skb_frag_dma_map(dev, frag, offset, size, DMA_TO_DEVICE) skb_frag_dma_map(dev, frag, offset, size, dir) -> __skb_frag_dma_map(dev, frag, offset, size, dir) No object code size changes for the existing callers. Users passing less arguments also won't have bigger size comparing to the full equivalent call. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Sometimes, there's a need to modify a lot of static keys or modify the same key multiple times in a loop. In that case, it seems more optimal to lock cpu_read_lock once and then call _cpuslocked() variants. The enable/disable functions are already exported, the refcounted counterparts however are not. Fix that to allow modules to save some cycles. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Expand libeth's Page Pool functionality by adding native XDP support. This means picking the appropriate headroom and DMA direction. Also, register all the created &page_pools as XDP memory models. A driver then can call xdp_rxq_info_attach_page_pool() when registering its RxQ info. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

"Couple" is a bit humbly... Add the following functionality to libeth: * XDP shared queues managing * XDP_TX bulk sending infra * .ndo_xdp_xmit() infra * adding buffers to &xdp_buff * running XDP prog and managing its verdict * completing XDP Tx buffers * ^ repeat everything for XSk Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # lots of stuff Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Extend completion queue cleaning function to support queue-based scheduling mode needed for XDP queues. Add 4-byte descriptor for queue-based scheduling mode and perform some refactoring to extract the common code for both scheduling modes. Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

SW marker descriptors on completion queues are used only when a queue is about to be destroyed. It's far from hotpath and handling it in the hotpath NAPI poll makes no sense. Instead, run a simple poller after a virtchnl message for destroying the queue is sent and wait for the replies. If replies for all of the queues are received, this means the synchronization is done correctly and we can go forth with stopping the link. Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue', 'idpf_vport_user_config_data') by adding members necessary to support XDP. Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions without interfering a regular Tx traffic. Also add functions dedicated to support XDP initialization for Rx and Tx queues and call those functions from the existing algorithms of queues configuration. Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Implement loading the XDP program using ndo_bpf callback for splitq and XDP_SETUP_PROG parameter. Add functions for stopping, reconfiguring and restarting all queues when needed. Also, implement the XDP hot swap mechanism when the existing XDP program is replaced by another one (without a necessity of reconfiguring anything). Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

In preparation of XDP support, move from having skb as the main frame container during the Rx polling to &xdp_buff. This allows to use generic and libie helpers for building an XDP buffer and changes the logics: now we try to allocate an skb only when we processed all the descriptors related to the frame. Store &libeth_xdp_stash instead of the skb pointer on the Rx queue. It's only 8 bytes wider and there's a place to fit it in. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Use libeth XDP infra to support running XDP program on Rx polling. This includes all of the possible verdicts/actions. XDP Tx queues are cleaned only in "lazy" mode when there are less than 1/4 free descriptors left on the ring. libeth helper macros to define driver-specific XDP functions make sure the compiler could uninline them when needed. Use __LIBETH_WORD_ACCESS to parse descriptors more efficiently when applicable. It really gives some good boosts and code size reduction on x86_64. Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Use libeth XDP infra to implement .ndo_xdp_xmit() in idpf. The Tx callbacks are reused from XDP_TX code. XDP redirect target feature is set/cleared depending on the XDP prog presence, as for now we still don't allocate XDP Tx queues when there's no program. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Add &xdp_metadata_ops with a callback to get RSS hash hint from the descriptor. Declare the splitq 32-byte descriptor as 4 u64s to parse them more efficiently when possible. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Implement VC functions dedicated to enabling, disabling and configuring randomly selected queues. Also, refactor the existing implementation to make the code more modular. Introduce new generic functions for sending VC messages consisting of chunks, in order to isolate the sending algorithm and its implementation for specific VC messages. Finally, rewrite the function for mapping queues to q_vectors using the new modular approach to avoid copying the code that implements the VC message sending algorithm. Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Add functionality to setup an XSk buffer pool, including ability to stop, reconfig and start only selected queues, not the whole device. Pool DMA mapping is managed by libeth. Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Implement Tx handling for AF_XDP feature in zero-copy mode using the libeth (libeth_xdp) XSk infra. When the NAPI poll is called, XSk Tx queues are polled first, before regular Tx and Rx. They're generally faster to serve and have higher priority comparing to regular traffic. Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Implement Rx packet processing specific to AF_XDP ZC using the libeth XSk infra. Initialize queue registers before allocating buffers to avoid redundant ifs when updating the queue tail. Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Now that AF_XDP functionality is fully implemented, advertise XSk XDP feature and add .ndo_xsk_wakeup() callback to be able to use it with this driver. Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The commit 35d653a ("idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq") uses soft_reset to perform a full vport reconfiguration after changes in XDP setup. Unfortunately, the soft_reset may fail after attaching the XDP program to the vport. It can happen when the HW limits resources that can be allocated to fulfill XDP requirements. In such a case, before we return an error from XDP_SETUP_PROG, we have to fully restore the previous vport state, including removing the XDP program. In order to remove the already loaded XDP program in case of reset error, re-implement the error handling path and move some calls to the XDP callback. Fixes: 35d653a ("idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq") Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>

Syzkaller reported this warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0 Modules linked in: CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78 FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x88/0x130 ? inet_sock_destruct+0x1c5/0x1e0 ? report_bug+0x18e/0x1a0 ? handle_bug+0x53/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? inet_sock_destruct+0x1c5/0x1e0 __sk_destruct+0x2a/0x200 rcu_do_batch+0x1aa/0x530 ? rcu_do_batch+0x13b/0x530 rcu_core+0x159/0x2f0 handle_softirqs+0xd3/0x2b0 ? __pfx_smpboot_thread_fn+0x10/0x10 run_ksoftirqd+0x25/0x30 smpboot_thread_fn+0xdd/0x1d0 kthread+0xd3/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add() concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked, which triggers a data-race around sk->sk_forward_alloc: tcp_v6_rcv tcp_v6_do_rcv skb_clone_and_charge_r sk_rmem_schedule __sk_mem_schedule sk_forward_alloc_add() skb_set_owner_r sk_mem_charge sk_forward_alloc_add() __kfree_skb skb_release_all skb_release_head_state sock_rfree sk_mem_uncharge sk_forward_alloc_add() sk_mem_reclaim // set local var reclaimable __sk_mem_reclaim sk_forward_alloc_add() In this syzkaller testcase, two threads call tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like this: (cpu 1) | (cpu 2) | sk_forward_alloc ... | ... | 0 __sk_mem_schedule() | | +4096 = 4096 | __sk_mem_schedule() | +4096 = 8192 sk_mem_charge() | | -768 = 7424 | sk_mem_charge() | -768 = 6656 ... | ... | sk_mem_uncharge() | | +768 = 7424 reclaimable=7424 | | | sk_mem_uncharge() | +768 = 8192 | reclaimable=8192 | __sk_mem_reclaim() | | -4096 = 4096 | __sk_mem_reclaim() | -8192 = -4096 != 0 The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock(). Fix the same issue in dccp_v6_do_rcv(). Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Fixes: e994b2f ("tcp: do not lock listener to process SYN packets") Signed-off-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

This fixes the circular locking dependency warning below, by reworking iso_sock_recvmsg, to ensure that the socket lock is always released before calling a function that locks hdev. [ 561.670344] ====================================================== [ 561.670346] WARNING: possible circular locking dependency detected [ 561.670349] 6.12.0-rc6+ #26 Not tainted [ 561.670351] ------------------------------------------------------ [ 561.670353] iso-tester/3289 is trying to acquire lock: [ 561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3}, at: iso_conn_big_sync+0x73/0x260 [bluetooth] [ 561.670405] but task is already holding lock: [ 561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}, at: iso_sock_recvmsg+0xbf/0x500 [bluetooth] [ 561.670450] which lock already depends on the new lock. [ 561.670452] the existing dependency chain (in reverse order) is: [ 561.670453] -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}: [ 561.670458] lock_acquire+0x7c/0xc0 [ 561.670463] lock_sock_nested+0x3b/0xf0 [ 561.670467] bt_accept_dequeue+0x1a5/0x4d0 [bluetooth] [ 561.670510] iso_sock_accept+0x271/0x830 [bluetooth] [ 561.670547] do_accept+0x3dd/0x610 [ 561.670550] __sys_accept4+0xd8/0x170 [ 561.670553] __x64_sys_accept+0x74/0xc0 [ 561.670556] x64_sys_call+0x17d6/0x25f0 [ 561.670559] do_syscall_64+0x87/0x150 [ 561.670563] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670567] -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 561.670571] lock_acquire+0x7c/0xc0 [ 561.670574] lock_sock_nested+0x3b/0xf0 [ 561.670577] iso_sock_listen+0x2de/0xf30 [bluetooth] [ 561.670617] __sys_listen_socket+0xef/0x130 [ 561.670620] __x64_sys_listen+0xe1/0x190 [ 561.670623] x64_sys_call+0x2517/0x25f0 [ 561.670626] do_syscall_64+0x87/0x150 [ 561.670629] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670632] -> #0 (&hdev->lock){+.+.}-{3:3}: [ 561.670636] __lock_acquire+0x32ad/0x6ab0 [ 561.670639] lock_acquire.part.0+0x118/0x360 [ 561.670642] lock_acquire+0x7c/0xc0 [ 561.670644] __mutex_lock+0x18d/0x12f0 [ 561.670647] mutex_lock_nested+0x1b/0x30 [ 561.670651] iso_conn_big_sync+0x73/0x260 [bluetooth] [ 561.670687] iso_sock_recvmsg+0x3e9/0x500 [bluetooth] [ 561.670722] sock_recvmsg+0x1d5/0x240 [ 561.670725] sock_read_iter+0x27d/0x470 [ 561.670727] vfs_read+0x9a0/0xd30 [ 561.670731] ksys_read+0x1a8/0x250 [ 561.670733] __x64_sys_read+0x72/0xc0 [ 561.670736] x64_sys_call+0x1b12/0x25f0 [ 561.670738] do_syscall_64+0x87/0x150 [ 561.670741] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670744] other info that might help us debug this: [ 561.670745] Chain exists of: &hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH [ 561.670751] Possible unsafe locking scenario: [ 561.670753] CPU0 CPU1 [ 561.670754] ---- ---- [ 561.670756] lock(sk_lock-AF_BLUETOOTH); [ 561.670758] lock(sk_lock AF_BLUETOOTH-BTPROTO_ISO); [ 561.670761] lock(sk_lock-AF_BLUETOOTH); [ 561.670764] lock(&hdev->lock); [ 561.670767] *** DEADLOCK *** Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

alobakin and others added 30 commits August 19, 2024 15:42

net: napi: add ability to create CPU-pinned threaded NAPI

16f6271

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

bpf: cpumap: use CPU-pinned threaded NAPI instead of kthread

4f13a3d

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

bpf: cpumap: reuse skb array instead of a linked list to chain skbs

562765d

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

alobakin and others added 20 commits August 19, 2024 17:04

idpf: add XDP RSS hash hint

acb41d4

Add &xdp_metadata_ops with a callback to get RSS hash hint from the descriptor. Declare the splitq 32-byte descriptor as 4 u64s to parse them more efficiently when possible. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

alobakin force-pushed the idpf-libie-new branch 3 times, most recently from afb0ef4 to 448e1cc Compare August 20, 2024 13:34

alobakin force-pushed the idpf-libie-new branch 5 times, most recently from 205aad8 to 07f2c7b Compare September 3, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idpf: fix error handling in soft_reset for XDP #26

idpf: fix error handling in soft_reset for XDP #26

michalQb commented Aug 19, 2024

idpf: fix error handling in soft_reset for XDP #26

Are you sure you want to change the base?

idpf: fix error handling in soft_reset for XDP #26

Conversation

michalQb commented Aug 19, 2024