Skip to content

Commit 9610bd9

Browse files
mfijalkoanguy11
authored andcommitted
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
1 parent eb087cd commit 9610bd9

File tree

4 files changed

+88
-25
lines changed

4 files changed

+88
-25
lines changed

drivers/net/ethernet/intel/ice/ice_main.c

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2372,7 +2372,8 @@ static int ice_vsi_req_irq_msix(struct ice_vsi *vsi, char *basename)
23722372
static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
23732373
{
23742374
struct device *dev = ice_pf_to_dev(vsi->back);
2375-
int i;
2375+
struct ice_tx_desc *tx_desc;
2376+
int i, j;
23762377

23772378
for (i = 0; i < vsi->num_xdp_txq; i++) {
23782379
u16 xdp_q_idx = vsi->alloc_txq + i;
@@ -2387,13 +2388,19 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
23872388
xdp_ring->reg_idx = vsi->txq_map[xdp_q_idx];
23882389
xdp_ring->vsi = vsi;
23892390
xdp_ring->netdev = NULL;
2391+
xdp_ring->next_dd = ICE_TX_THRESH - 1;
2392+
xdp_ring->next_rs = ICE_TX_THRESH - 1;
23902393
xdp_ring->dev = dev;
23912394
xdp_ring->count = vsi->num_tx_desc;
23922395
WRITE_ONCE(vsi->xdp_rings[i], xdp_ring);
23932396
if (ice_setup_tx_ring(xdp_ring))
23942397
goto free_xdp_rings;
23952398
ice_set_ring_xdp(xdp_ring);
23962399
xdp_ring->xsk_pool = ice_tx_xsk_pool(xdp_ring);
2400+
for (j = 0; j < xdp_ring->count; j++) {
2401+
tx_desc = ICE_TX_DESC(xdp_ring, j);
2402+
tx_desc->cmd_type_offset_bsz = cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE);
2403+
}
23972404
}
23982405

23992406
ice_for_each_rxq(vsi, i)

drivers/net/ethernet/intel/ice/ice_txrx.c

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -247,11 +247,8 @@ static bool ice_clean_tx_irq(struct ice_tx_ring *tx_ring, int napi_budget)
247247
total_bytes += tx_buf->bytecount;
248248
total_pkts += tx_buf->gso_segs;
249249

250-
if (ice_ring_is_xdp(tx_ring))
251-
page_frag_free(tx_buf->raw_buf);
252-
else
253-
/* free the skb */
254-
napi_consume_skb(tx_buf->skb, napi_budget);
250+
/* free the skb */
251+
napi_consume_skb(tx_buf->skb, napi_budget);
255252

256253
/* unmap skb header data */
257254
dma_unmap_single(tx_ring->dev,
@@ -307,9 +304,6 @@ static bool ice_clean_tx_irq(struct ice_tx_ring *tx_ring, int napi_budget)
307304

308305
ice_update_tx_ring_stats(tx_ring, total_pkts, total_bytes);
309306

310-
if (ice_ring_is_xdp(tx_ring))
311-
return !!budget;
312-
313307
netdev_tx_completed_queue(txring_txq(tx_ring), total_pkts,
314308
total_bytes);
315309

@@ -1418,9 +1412,14 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
14181412
* budget and be more aggressive about cleaning up the Tx descriptors.
14191413
*/
14201414
ice_for_each_tx_ring(tx_ring, q_vector->tx) {
1421-
bool wd = tx_ring->xsk_pool ?
1422-
ice_clean_tx_irq_zc(tx_ring, budget) :
1423-
ice_clean_tx_irq(tx_ring, budget);
1415+
bool wd;
1416+
1417+
if (tx_ring->xsk_pool)
1418+
wd = ice_clean_tx_irq_zc(tx_ring, budget);
1419+
else if (ice_ring_is_xdp(tx_ring))
1420+
wd = true;
1421+
else
1422+
wd = ice_clean_tx_irq(tx_ring, budget);
14241423

14251424
if (!wd)
14261425
clean_complete = false;

drivers/net/ethernet/intel/ice/ice_txrx.h

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#define ICE_MAX_CHAINED_RX_BUFS 5
1414
#define ICE_MAX_BUF_TXD 8
1515
#define ICE_MIN_TX_LEN 17
16+
#define ICE_TX_THRESH 32
1617

1718
/* The size limit for a transmit buffer in a descriptor is (16K - 1).
1819
* In order to align with the read requests we will align the value to
@@ -310,12 +311,15 @@ struct ice_tx_ring {
310311
struct ice_vsi *vsi; /* Backreference to associated VSI */
311312
/* CL2 - 2nd cacheline starts here */
312313
dma_addr_t dma; /* physical address of ring */
314+
struct xsk_buff_pool *xsk_pool;
313315
u16 next_to_use;
314316
u16 next_to_clean;
317+
u16 next_rs;
318+
u16 next_dd;
319+
u16 q_handle; /* Queue handle per TC */
320+
u16 reg_idx; /* HW register index of the ring */
315321
u16 count; /* Number of descriptors */
316322
u16 q_index; /* Queue number of ring */
317-
struct xsk_buff_pool *xsk_pool;
318-
319323
/* stats structs */
320324
struct ice_q_stats stats;
321325
struct u64_stats_sync syncp;
@@ -326,8 +330,6 @@ struct ice_tx_ring {
326330
DECLARE_BITMAP(xps_state, ICE_TX_NBITS); /* XPS Config State */
327331
struct ice_ptp_tx *tx_tstamps;
328332
u32 txq_teid; /* Added Tx queue TEID */
329-
u16 q_handle; /* Queue handle per TC */
330-
u16 reg_idx; /* HW register index of the ring */
331333
#define ICE_TX_FLAGS_RING_XDP BIT(0)
332334
u8 flags;
333335
u8 dcb_tc; /* Traffic class of ring */

drivers/net/ethernet/intel/ice/ice_txrx_lib.c

Lines changed: 64 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
#include "ice_txrx_lib.h"
55
#include "ice_eswitch.h"
6+
#include "ice_lib.h"
67

78
/**
89
* ice_release_rx_desc - Store the new tail and head values
@@ -213,6 +214,52 @@ ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
213214
napi_gro_receive(&rx_ring->q_vector->napi, skb);
214215
}
215216

217+
/**
218+
* ice_clean_xdp_irq - Reclaim resources after transmit completes on XDP ring
219+
* @xdp_ring: XDP ring to clean
220+
*/
221+
static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
222+
{
223+
unsigned int total_bytes = 0, total_pkts = 0;
224+
u16 ntc = xdp_ring->next_to_clean;
225+
struct ice_tx_desc *next_dd_desc;
226+
u16 next_dd = xdp_ring->next_dd;
227+
struct ice_tx_buf *tx_buf;
228+
int i;
229+
230+
next_dd_desc = ICE_TX_DESC(xdp_ring, next_dd);
231+
if (!(next_dd_desc->cmd_type_offset_bsz &
232+
cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
233+
return;
234+
235+
for (i = 0; i < ICE_TX_THRESH; i++) {
236+
tx_buf = &xdp_ring->tx_buf[ntc];
237+
238+
total_bytes += tx_buf->bytecount;
239+
/* normally tx_buf->gso_segs was taken but at this point
240+
* it's always 1 for us
241+
*/
242+
total_pkts++;
243+
244+
page_frag_free(tx_buf->raw_buf);
245+
dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
246+
dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
247+
dma_unmap_len_set(tx_buf, len, 0);
248+
tx_buf->raw_buf = NULL;
249+
250+
ntc++;
251+
if (ntc >= xdp_ring->count)
252+
ntc = 0;
253+
}
254+
255+
next_dd_desc->cmd_type_offset_bsz = 0;
256+
xdp_ring->next_dd = xdp_ring->next_dd + ICE_TX_THRESH;
257+
if (xdp_ring->next_dd > xdp_ring->count)
258+
xdp_ring->next_dd = ICE_TX_THRESH - 1;
259+
xdp_ring->next_to_clean = ntc;
260+
ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
261+
}
262+
216263
/**
217264
* ice_xmit_xdp_ring - submit single packet to XDP ring for transmission
218265
* @data: packet data pointer
@@ -226,6 +273,9 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
226273
struct ice_tx_buf *tx_buf;
227274
dma_addr_t dma;
228275

276+
if (ICE_DESC_UNUSED(xdp_ring) < ICE_TX_THRESH)
277+
ice_clean_xdp_irq(xdp_ring);
278+
229279
if (!unlikely(ICE_DESC_UNUSED(xdp_ring))) {
230280
xdp_ring->tx_stats.tx_busy++;
231281
return ICE_XDP_CONSUMED;
@@ -246,21 +296,26 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
246296

247297
tx_desc = ICE_TX_DESC(xdp_ring, i);
248298
tx_desc->buf_addr = cpu_to_le64(dma);
249-
tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TXD_LAST_DESC_CMD, 0,
299+
tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP, 0,
250300
size, 0);
251301

252-
/* Make certain all of the status bits have been updated
253-
* before next_to_watch is written.
254-
*/
255-
smp_wmb();
256-
257302
i++;
258-
if (i == xdp_ring->count)
303+
if (i == xdp_ring->count) {
259304
i = 0;
260-
261-
tx_buf->next_to_watch = tx_desc;
305+
tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
306+
tx_desc->cmd_type_offset_bsz |=
307+
cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
308+
xdp_ring->next_rs = ICE_TX_THRESH - 1;
309+
}
262310
xdp_ring->next_to_use = i;
263311

312+
if (i > xdp_ring->next_rs) {
313+
tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
314+
tx_desc->cmd_type_offset_bsz |=
315+
cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
316+
xdp_ring->next_rs += ICE_TX_THRESH;
317+
}
318+
264319
return ICE_XDP_TX;
265320
}
266321

0 commit comments

Comments
 (0)