Skip to content

Commit bbff2f3

Browse files
Björn Töpelborkmann
authored andcommitted
xsk: new descriptor addressing scheme
Currently, AF_XDP only supports a fixed frame-size memory scheme where each frame is referenced via an index (idx). A user passes the frame index to the kernel, and the kernel acts upon the data. Some NICs, however, do not have a fixed frame-size model, instead they have a model where a memory window is passed to the hardware and multiple frames are filled into that window (referred to as the "type-writer" model). By changing the descriptor format from the current frame index addressing scheme, AF_XDP can in the future be extended to support these kinds of NICs. In the index-based model, an idx refers to a frame of size frame_size. Addressing a frame in the UMEM is done by offseting the UMEM starting address by a global offset, idx * frame_size + offset. Communicating via the fill- and completion-rings are done by means of idx. In this commit, the idx is removed in favor of an address (addr), which is a relative address ranging over the UMEM. To convert an idx-based address to the new addr is simply: addr = idx * frame_size + offset. We also stop referring to the UMEM "frame" as a frame. Instead it is simply called a chunk. To transfer ownership of a chunk to the kernel, the addr of the chunk is passed in the fill-ring. Note, that the kernel will mask addr to make it chunk aligned, so there is no need for userspace to do that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or 3000 to the fill-ring will refer to the same chunk. On the completion-ring, the addr will match that of the Tx descriptor, passed to the kernel. Changing the descriptor format to use chunks/addr will allow for future changes to move to a type-writer based model, where multiple frames can reside in one chunk. In this model passing one single chunk into the fill-ring, would potentially result in multiple Rx descriptors. This commit changes the uapi of AF_XDP sockets, and updates the documentation. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
1 parent a509a95 commit bbff2f3

File tree

8 files changed

+123
-129
lines changed

8 files changed

+123
-129
lines changed

Documentation/networking/af_xdp.rst

Lines changed: 58 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ packet processing.
1212

1313
This document assumes that the reader is familiar with BPF and XDP. If
1414
not, the Cilium project has an excellent reference guide at
15-
http://cilium.readthedocs.io/en/doc-1.0/bpf/.
15+
http://cilium.readthedocs.io/en/latest/bpf/.
1616

1717
Using the XDP_REDIRECT action from an XDP program, the program can
1818
redirect ingress frames to other XDP enabled netdevs, using the
@@ -33,22 +33,22 @@ for a while due to a possible retransmit, the descriptor that points
3333
to that packet can be changed to point to another and reused right
3434
away. This again avoids copying data.
3535

36-
The UMEM consists of a number of equally size frames and each frame
37-
has a unique frame id. A descriptor in one of the rings references a
38-
frame by referencing its frame id. The user space allocates memory for
39-
this UMEM using whatever means it feels is most appropriate (malloc,
40-
mmap, huge pages, etc). This memory area is then registered with the
41-
kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two
42-
rings: the FILL ring and the COMPLETION ring. The fill ring is used by
43-
the application to send down frame ids for the kernel to fill in with
44-
RX packet data. References to these frames will then appear in the RX
45-
ring once each packet has been received. The completion ring, on the
46-
other hand, contains frame ids that the kernel has transmitted
47-
completely and can now be used again by user space, for either TX or
48-
RX. Thus, the frame ids appearing in the completion ring are ids that
49-
were previously transmitted using the TX ring. In summary, the RX and
50-
FILL rings are used for the RX path and the TX and COMPLETION rings
51-
are used for the TX path.
36+
The UMEM consists of a number of equally sized chunks. A descriptor in
37+
one of the rings references a frame by referencing its addr. The addr
38+
is simply an offset within the entire UMEM region. The user space
39+
allocates memory for this UMEM using whatever means it feels is most
40+
appropriate (malloc, mmap, huge pages, etc). This memory area is then
41+
registered with the kernel using the new setsockopt XDP_UMEM_REG. The
42+
UMEM also has two rings: the FILL ring and the COMPLETION ring. The
43+
fill ring is used by the application to send down addr for the kernel
44+
to fill in with RX packet data. References to these frames will then
45+
appear in the RX ring once each packet has been received. The
46+
completion ring, on the other hand, contains frame addr that the
47+
kernel has transmitted completely and can now be used again by user
48+
space, for either TX or RX. Thus, the frame addrs appearing in the
49+
completion ring are addrs that were previously transmitted using the
50+
TX ring. In summary, the RX and FILL rings are used for the RX path
51+
and the TX and COMPLETION rings are used for the TX path.
5252

5353
The socket is then finally bound with a bind() call to a device and a
5454
specific queue id on that device, and it is not until bind is
@@ -59,13 +59,13 @@ wants to do this, it simply skips the registration of the UMEM and its
5959
corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
6060
call and submits the XSK of the process it would like to share UMEM
6161
with as well as its own newly created XSK socket. The new process will
62-
then receive frame id references in its own RX ring that point to this
63-
shared UMEM. Note that since the ring structures are single-consumer /
64-
single-producer (for performance reasons), the new process has to
65-
create its own socket with associated RX and TX rings, since it cannot
66-
share this with the other process. This is also the reason that there
67-
is only one set of FILL and COMPLETION rings per UMEM. It is the
68-
responsibility of a single process to handle the UMEM.
62+
then receive frame addr references in its own RX ring that point to
63+
this shared UMEM. Note that since the ring structures are
64+
single-consumer / single-producer (for performance reasons), the new
65+
process has to create its own socket with associated RX and TX rings,
66+
since it cannot share this with the other process. This is also the
67+
reason that there is only one set of FILL and COMPLETION rings per
68+
UMEM. It is the responsibility of a single process to handle the UMEM.
6969

7070
How is then packets distributed from an XDP program to the XSKs? There
7171
is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
@@ -102,10 +102,10 @@ UMEM
102102

103103
UMEM is a region of virtual contiguous memory, divided into
104104
equal-sized frames. An UMEM is associated to a netdev and a specific
105-
queue id of that netdev. It is created and configured (frame size,
106-
frame headroom, start address and size) by using the XDP_UMEM_REG
107-
setsockopt system call. A UMEM is bound to a netdev and queue id, via
108-
the bind() system call.
105+
queue id of that netdev. It is created and configured (chunk size,
106+
headroom, start address and size) by using the XDP_UMEM_REG setsockopt
107+
system call. A UMEM is bound to a netdev and queue id, via the bind()
108+
system call.
109109

110110
An AF_XDP is socket linked to a single UMEM, but one UMEM can have
111111
multiple AF_XDP sockets. To share an UMEM created via one socket A,
@@ -147,13 +147,17 @@ UMEM Fill Ring
147147
~~~~~~~~~~~~~~
148148

149149
The Fill ring is used to transfer ownership of UMEM frames from
150-
user-space to kernel-space. The UMEM indicies are passed in the
151-
ring. As an example, if the UMEM is 64k and each frame is 4k, then the
152-
UMEM has 16 frames and can pass indicies between 0 and 15.
150+
user-space to kernel-space. The UMEM addrs are passed in the ring. As
151+
an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
152+
16 chunks and can pass addrs between 0 and 64k.
153153

154154
Frames passed to the kernel are used for the ingress path (RX rings).
155155

156-
The user application produces UMEM indicies to this ring.
156+
The user application produces UMEM addrs to this ring. Note that the
157+
kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
158+
log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
159+
and 3000 refers to the same chunk.
160+
157161

158162
UMEM Completetion Ring
159163
~~~~~~~~~~~~~~~~~~~~~~
@@ -165,16 +169,15 @@ used.
165169
Frames passed from the kernel to user-space are frames that has been
166170
sent (TX ring) and can be used by user-space again.
167171

168-
The user application consumes UMEM indicies from this ring.
172+
The user application consumes UMEM addrs from this ring.
169173

170174

171175
RX Ring
172176
~~~~~~~
173177

174178
The RX ring is the receiving side of a socket. Each entry in the ring
175-
is a struct xdp_desc descriptor. The descriptor contains UMEM index
176-
(idx), the length of the data (len), the offset into the frame
177-
(offset).
179+
is a struct xdp_desc descriptor. The descriptor contains UMEM offset
180+
(addr) and the length of the data (len).
178181

179182
If no frames have been passed to kernel via the Fill ring, no
180183
descriptors will (or can) appear on the RX ring.
@@ -221,38 +224,50 @@ side is xdpsock_user.c and the XDP side xdpsock_kern.c.
221224

222225
Naive ring dequeue and enqueue could look like this::
223226

227+
// struct xdp_rxtx_ring {
228+
// __u32 *producer;
229+
// __u32 *consumer;
230+
// struct xdp_desc *desc;
231+
// };
232+
233+
// struct xdp_umem_ring {
234+
// __u32 *producer;
235+
// __u32 *consumer;
236+
// __u64 *desc;
237+
// };
238+
224239
// typedef struct xdp_rxtx_ring RING;
225240
// typedef struct xdp_umem_ring RING;
226241

227242
// typedef struct xdp_desc RING_TYPE;
228-
// typedef __u32 RING_TYPE;
243+
// typedef __u64 RING_TYPE;
229244

230245
int dequeue_one(RING *ring, RING_TYPE *item)
231246
{
232-
__u32 entries = ring->ptrs.producer - ring->ptrs.consumer;
247+
__u32 entries = *ring->producer - *ring->consumer;
233248

234249
if (entries == 0)
235250
return -1;
236251

237252
// read-barrier!
238253

239-
*item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)];
240-
ring->ptrs.consumer++;
254+
*item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
255+
(*ring->consumer)++;
241256
return 0;
242257
}
243258

244259
int enqueue_one(RING *ring, const RING_TYPE *item)
245260
{
246-
u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer);
261+
u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
247262

248263
if (free_entries == 0)
249264
return -1;
250265

251-
ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item;
266+
ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
252267

253268
// write-barrier!
254269

255-
ring->ptrs.producer++;
270+
(*ring->producer)++;
256271
return 0;
257272
}
258273

include/uapi/linux/if_xdp.h

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ struct xdp_mmap_offsets {
4848
struct xdp_umem_reg {
4949
__u64 addr; /* Start of packet data area */
5050
__u64 len; /* Length of packet data area */
51-
__u32 frame_size; /* Frame size */
52-
__u32 frame_headroom; /* Frame head room */
51+
__u32 chunk_size;
52+
__u32 headroom;
5353
};
5454

5555
struct xdp_statistics {
@@ -66,13 +66,11 @@ struct xdp_statistics {
6666

6767
/* Rx/Tx descriptor */
6868
struct xdp_desc {
69-
__u32 idx;
69+
__u64 addr;
7070
__u32 len;
71-
__u16 offset;
72-
__u8 flags;
73-
__u8 padding[5];
71+
__u32 options;
7472
};
7573

76-
/* UMEM descriptor is __u32 */
74+
/* UMEM descriptor is __u64 */
7775

7876
#endif /* _LINUX_IF_XDP_H */

net/xdp/xdp_umem.c

Lines changed: 15 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
#include "xdp_umem.h"
1616

17-
#define XDP_UMEM_MIN_FRAME_SIZE 2048
17+
#define XDP_UMEM_MIN_CHUNK_SIZE 2048
1818

1919
static void xdp_umem_unpin_pages(struct xdp_umem *umem)
2020
{
@@ -151,12 +151,12 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
151151

152152
static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
153153
{
154-
u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
154+
u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
155+
unsigned int chunks, chunks_per_page;
155156
u64 addr = mr->addr, size = mr->len;
156-
unsigned int nframes, nfpp;
157157
int size_chk, err;
158158

159-
if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
159+
if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
160160
/* Strictly speaking we could support this, if:
161161
* - huge pages, or*
162162
* - using an IOMMU, or
@@ -166,7 +166,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
166166
return -EINVAL;
167167
}
168168

169-
if (!is_power_of_2(frame_size))
169+
if (!is_power_of_2(chunk_size))
170170
return -EINVAL;
171171

172172
if (!PAGE_ALIGNED(addr)) {
@@ -179,33 +179,30 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
179179
if ((addr + size) < addr)
180180
return -EINVAL;
181181

182-
nframes = (unsigned int)div_u64(size, frame_size);
183-
if (nframes == 0 || nframes > UINT_MAX)
182+
chunks = (unsigned int)div_u64(size, chunk_size);
183+
if (chunks == 0)
184184
return -EINVAL;
185185

186-
nfpp = PAGE_SIZE / frame_size;
187-
if (nframes < nfpp || nframes % nfpp)
186+
chunks_per_page = PAGE_SIZE / chunk_size;
187+
if (chunks < chunks_per_page || chunks % chunks_per_page)
188188
return -EINVAL;
189189

190-
frame_headroom = ALIGN(frame_headroom, 64);
190+
headroom = ALIGN(headroom, 64);
191191

192-
size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
192+
size_chk = chunk_size - headroom - XDP_PACKET_HEADROOM;
193193
if (size_chk < 0)
194194
return -EINVAL;
195195

196196
umem->pid = get_task_pid(current, PIDTYPE_PID);
197-
umem->size = (size_t)size;
198197
umem->address = (unsigned long)addr;
199-
umem->props.frame_size = frame_size;
200-
umem->props.nframes = nframes;
201-
umem->frame_headroom = frame_headroom;
198+
umem->props.chunk_mask = ~((u64)chunk_size - 1);
199+
umem->props.size = size;
200+
umem->headroom = headroom;
201+
umem->chunk_size_nohr = chunk_size - headroom;
202202
umem->npgs = size / PAGE_SIZE;
203203
umem->pgs = NULL;
204204
umem->user = NULL;
205205

206-
umem->frame_size_log2 = ilog2(frame_size);
207-
umem->nfpp_mask = nfpp - 1;
208-
umem->nfpplog2 = ilog2(nfpp);
209206
refcount_set(&umem->users, 1);
210207

211208
err = xdp_umem_account_pages(umem);

net/xdp/xdp_umem.h

Lines changed: 6 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -18,35 +18,20 @@ struct xdp_umem {
1818
struct xsk_queue *cq;
1919
struct page **pgs;
2020
struct xdp_umem_props props;
21-
u32 npgs;
22-
u32 frame_headroom;
23-
u32 nfpp_mask;
24-
u32 nfpplog2;
25-
u32 frame_size_log2;
21+
u32 headroom;
22+
u32 chunk_size_nohr;
2623
struct user_struct *user;
2724
struct pid *pid;
2825
unsigned long address;
29-
size_t size;
3026
refcount_t users;
3127
struct work_struct work;
28+
u32 npgs;
3229
};
3330

34-
static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
35-
{
36-
u64 pg, off;
37-
char *data;
38-
39-
pg = idx >> umem->nfpplog2;
40-
off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
41-
42-
data = page_address(umem->pgs[pg]);
43-
return data + off;
44-
}
45-
46-
static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
47-
u32 idx)
31+
static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
4832
{
49-
return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
33+
return page_address(umem->pgs[addr >> PAGE_SHIFT]) +
34+
(addr & (PAGE_SIZE - 1));
5035
}
5136

5237
bool xdp_umem_validate_queues(struct xdp_umem *umem);

net/xdp/xdp_umem_props.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@
77
#define XDP_UMEM_PROPS_H_
88

99
struct xdp_umem_props {
10-
u32 frame_size;
11-
u32 nframes;
10+
u64 chunk_mask;
11+
u64 size;
1212
};
1313

1414
#endif /* XDP_UMEM_PROPS_H_ */

0 commit comments

Comments
 (0)