Skip to content

Commit edab563

Browse files
committed
docs: add pvrdma device documentation.
Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
1 parent 06329cc commit edab563

File tree

1 file changed

+255
-0
lines changed

1 file changed

+255
-0
lines changed

docs/pvrdma.txt

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
Paravirtualized RDMA Device (PVRDMA)
2+
====================================
3+
4+
5+
1. Description
6+
===============
7+
PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
8+
It works with its Linux Kernel driver AS IS, no need for any special guest
9+
modifications.
10+
11+
While it complies with the VMware device, it can also communicate with bare
12+
metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
13+
can work with Soft-RoCE (rxe).
14+
15+
It does not require the whole guest RAM to be pinned allowing memory
16+
over-commit and, even if not implemented yet, migration support will be
17+
possible with some HW assistance.
18+
19+
A project presentation accompany this document:
20+
- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
21+
22+
23+
24+
2. Setup
25+
========
26+
27+
28+
2.1 Guest setup
29+
===============
30+
Fedora 27+ kernels work out of the box, older distributions
31+
require updating the kernel to 4.14 to include the pvrdma driver.
32+
33+
However the libpvrdma library needed by User Level Software is still
34+
not available as part of the distributions, so the rdma-core library
35+
needs to be compiled and optionally installed.
36+
37+
Please follow the instructions at:
38+
https://github.com/linux-rdma/rdma-core.git
39+
40+
41+
2.2 Host Setup
42+
==============
43+
The pvrdma backend is an ibdevice interface that can be exposed
44+
either by a Soft-RoCE(rxe) device on machines with no RDMA device,
45+
or an HCA SRIOV function(VF/PF).
46+
Note that ibdevice interfaces can't be shared between pvrdma devices,
47+
each one requiring a separate instance (rxe or SRIOV VF).
48+
49+
50+
2.2.1 Soft-RoCE backend(rxe)
51+
===========================
52+
A stable version of rxe is required, Fedora 27+ or a Linux
53+
Kernel 4.14+ is preferred.
54+
55+
The rdma_rxe module is part of the Linux Kernel but not loaded by default.
56+
Install the User Level library (librxe) following the instructions from:
57+
https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
58+
59+
Associate an ETH interface with rxe by running:
60+
rxe_cfg add eth0
61+
An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
62+
63+
64+
2.2.2 RDMA device Virtual Function backend
65+
==========================================
66+
Nothing special is required, the pvrdma device can work not only with
67+
Ethernet Links, but also Infinibands Links.
68+
All is needed is an ibdevice with an active port, for Mellanox cards
69+
will be something like mlx5_6 which can be the backend.
70+
71+
72+
2.2.3 QEMU setup
73+
================
74+
Configure QEMU with --enable-rdma flag, installing
75+
the required RDMA libraries.
76+
77+
78+
79+
3. Usage
80+
========
81+
Currently the device is working only with memory backed RAM
82+
and it must be mark as "shared":
83+
-m 1G \
84+
-object memory-backend-ram,id=mb1,size=1G,share \
85+
-numa node,memdev=mb1 \
86+
87+
The pvrdma device is composed of two functions:
88+
- Function 0 is a vmxnet Ethernet Device which is redundant in Guest
89+
but is required to pass the ibdevice GID using its MAC.
90+
Examples:
91+
For an rxe backend using eth0 interface it will use its mac:
92+
-device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
93+
For an SRIOV VF, we take the Ethernet Interface exposed by it:
94+
-device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
95+
- Function 1 is the actual device:
96+
-device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
97+
where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
98+
Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
99+
The rules of conversion are part of the RoCE spec, but since manual conversion
100+
is not required, spotting problems is not hard:
101+
Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
102+
MAC: 7c:fe:90:cb:74:3a
103+
Note the difference between the first byte of the MAC and the GID.
104+
105+
106+
107+
4. Implementation details
108+
=========================
109+
110+
111+
4.1 Overview
112+
============
113+
The device acts like a proxy between the Guest Driver and the host
114+
ibdevice interface.
115+
On configuration path:
116+
- For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
117+
a resource from the backend interface, maintaining a 1-1 mapping
118+
between the guest and host.
119+
On data path:
120+
- Every post_send/receive received from the guest will be converted into
121+
a post_send/receive for the backend. The buffers data will not be touched
122+
or copied resulting in near bare-metal performance for large enough buffers.
123+
- Completions from the backend interface will result in completions for
124+
the pvrdma device.
125+
126+
127+
4.2 PCI BARs
128+
============
129+
PCI Bars:
130+
BAR 0 - MSI-X
131+
MSI-X vectors:
132+
(0) Command - used when execution of a command is completed.
133+
(1) Async - not in use.
134+
(2) Completion - used when a completion event is placed in
135+
device's CQ ring.
136+
BAR 1 - Registers
137+
--------------------------------------------------------
138+
| VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
139+
--------------------------------------------------------
140+
DSR - Address of driver/device shared memory used
141+
for the command channel, used for passing:
142+
- General info such as driver version
143+
- Address of 'command' and 'response'
144+
- Address of async ring
145+
- Address of device's CQ ring
146+
- Device capabilities
147+
CTL - Device control operations (activate, reset etc)
148+
IMG - Set interrupt mask
149+
REQ - Command execution register
150+
ERR - Operation status
151+
152+
BAR 2 - UAR
153+
---------------------------------------------------------
154+
| QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
155+
---------------------------------------------------------
156+
- Offset 0 used for QP operations (send and recv)
157+
- Offset 4 used for CQ operations (arm and poll)
158+
159+
160+
4.3 Major flows
161+
===============
162+
163+
4.3.1 Create CQ
164+
===============
165+
- Guest driver
166+
- Allocates pages for CQ ring
167+
- Creates page directory (pdir) to hold CQ ring's pages
168+
- Initializes CQ ring
169+
- Initializes 'Create CQ' command object (cqe, pdir etc)
170+
- Copies the command to 'command' address
171+
- Writes 0 into REQ register
172+
- Device
173+
- Reads the request object from the 'command' address
174+
- Allocates CQ object and initialize CQ ring based on pdir
175+
- Creates the backend CQ
176+
- Writes operation status to ERR register
177+
- Posts command-interrupt to guest
178+
- Guest driver
179+
- Reads the HW response code from ERR register
180+
181+
4.3.2 Create QP
182+
===============
183+
- Guest driver
184+
- Allocates pages for send and receive rings
185+
- Creates page directory(pdir) to hold the ring's pages
186+
- Initializes 'Create QP' command object (max_send_wr,
187+
send_cq_handle, recv_cq_handle, pdir etc)
188+
- Copies the object to 'command' address
189+
- Write 0 into REQ register
190+
- Device
191+
- Reads the request object from 'command' address
192+
- Allocates the QP object and initialize
193+
- Send and recv rings based on pdir
194+
- Send and recv ring state
195+
- Creates the backend QP
196+
- Writes the operation status to ERR register
197+
- Posts command-interrupt to guest
198+
- Guest driver
199+
- Reads the HW response code from ERR register
200+
201+
4.3.3 Post receive
202+
==================
203+
- Guest driver
204+
- Initializes a wqe and place it on recv ring
205+
- Write to qpn|qp_recv_bit (31) to QP offset in UAR
206+
- Device
207+
- Extracts qpn from UAR
208+
- Walks through the ring and does the following for each wqe
209+
- Prepares the backend CQE context to be used when
210+
receiving completion from backend (wr_id, op_code, emu_cq_num)
211+
- For each sge prepares backend sge
212+
- Calls backend's post_recv
213+
214+
4.3.4 Process backend events
215+
============================
216+
- Done by a dedicated thread used to process backend events;
217+
at initialization is attached to the device and creates
218+
the communication channel.
219+
- Thread main loop:
220+
- Polls for completions
221+
- Extracts QEMU _cq_num, wr_id and op_code from context
222+
- Writes CQE to CQ ring
223+
- Writes CQ number to device CQ
224+
- Sends completion-interrupt to guest
225+
- Deallocates context
226+
- Acks the event to backend
227+
228+
229+
230+
5. Limitations
231+
==============
232+
- The device obviously is limited by the Guest Linux Driver features implementation
233+
of the VMware device API.
234+
- Memory registration mechanism requires mremap for every page in the buffer in order
235+
to map it to a contiguous virtual address range. Since this is not the data path
236+
it should not matter much. If the default max mr size is increased, be aware that
237+
memory registration can take up to 0.5 seconds for 1GB of memory.
238+
- The device requires target page size to be the same as the host page size,
239+
otherwise it will fail to init.
240+
- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
241+
so it can't work with huge pages. The limitation will be addressed in the future,
242+
however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
243+
pages available, QEMU will use them. QEMU will fail to init if the requirements
244+
are not met.
245+
246+
247+
248+
6. Performance
249+
==============
250+
By design the pvrdma device exits on each post-send/receive, so for small buffers
251+
the performance is affected; however for medium buffers it will became close to
252+
bare metal and from 1MB buffers and up it reaches bare metal performance.
253+
(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
254+
255+
All the above assumes no memory registration is done on data path.

0 commit comments

Comments
 (0)