|
| 1 | +Paravirtualized RDMA Device (PVRDMA) |
| 2 | +==================================== |
| 3 | + |
| 4 | + |
| 5 | +1. Description |
| 6 | +=============== |
| 7 | +PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. |
| 8 | +It works with its Linux Kernel driver AS IS, no need for any special guest |
| 9 | +modifications. |
| 10 | + |
| 11 | +While it complies with the VMware device, it can also communicate with bare |
| 12 | +metal RDMA-enabled machines and does not require an RDMA HCA in the host, it |
| 13 | +can work with Soft-RoCE (rxe). |
| 14 | + |
| 15 | +It does not require the whole guest RAM to be pinned allowing memory |
| 16 | +over-commit and, even if not implemented yet, migration support will be |
| 17 | +possible with some HW assistance. |
| 18 | + |
| 19 | +A project presentation accompany this document: |
| 20 | +- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf |
| 21 | + |
| 22 | + |
| 23 | + |
| 24 | +2. Setup |
| 25 | +======== |
| 26 | + |
| 27 | + |
| 28 | +2.1 Guest setup |
| 29 | +=============== |
| 30 | +Fedora 27+ kernels work out of the box, older distributions |
| 31 | +require updating the kernel to 4.14 to include the pvrdma driver. |
| 32 | + |
| 33 | +However the libpvrdma library needed by User Level Software is still |
| 34 | +not available as part of the distributions, so the rdma-core library |
| 35 | +needs to be compiled and optionally installed. |
| 36 | + |
| 37 | +Please follow the instructions at: |
| 38 | + https://github.com/linux-rdma/rdma-core.git |
| 39 | + |
| 40 | + |
| 41 | +2.2 Host Setup |
| 42 | +============== |
| 43 | +The pvrdma backend is an ibdevice interface that can be exposed |
| 44 | +either by a Soft-RoCE(rxe) device on machines with no RDMA device, |
| 45 | +or an HCA SRIOV function(VF/PF). |
| 46 | +Note that ibdevice interfaces can't be shared between pvrdma devices, |
| 47 | +each one requiring a separate instance (rxe or SRIOV VF). |
| 48 | + |
| 49 | + |
| 50 | +2.2.1 Soft-RoCE backend(rxe) |
| 51 | +=========================== |
| 52 | +A stable version of rxe is required, Fedora 27+ or a Linux |
| 53 | +Kernel 4.14+ is preferred. |
| 54 | + |
| 55 | +The rdma_rxe module is part of the Linux Kernel but not loaded by default. |
| 56 | +Install the User Level library (librxe) following the instructions from: |
| 57 | +https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home |
| 58 | + |
| 59 | +Associate an ETH interface with rxe by running: |
| 60 | + rxe_cfg add eth0 |
| 61 | +An rxe0 ibdevice interface will be created and can be used as pvrdma backend. |
| 62 | + |
| 63 | + |
| 64 | +2.2.2 RDMA device Virtual Function backend |
| 65 | +========================================== |
| 66 | +Nothing special is required, the pvrdma device can work not only with |
| 67 | +Ethernet Links, but also Infinibands Links. |
| 68 | +All is needed is an ibdevice with an active port, for Mellanox cards |
| 69 | +will be something like mlx5_6 which can be the backend. |
| 70 | + |
| 71 | + |
| 72 | +2.2.3 QEMU setup |
| 73 | +================ |
| 74 | +Configure QEMU with --enable-rdma flag, installing |
| 75 | +the required RDMA libraries. |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +3. Usage |
| 80 | +======== |
| 81 | +Currently the device is working only with memory backed RAM |
| 82 | +and it must be mark as "shared": |
| 83 | + -m 1G \ |
| 84 | + -object memory-backend-ram,id=mb1,size=1G,share \ |
| 85 | + -numa node,memdev=mb1 \ |
| 86 | + |
| 87 | +The pvrdma device is composed of two functions: |
| 88 | + - Function 0 is a vmxnet Ethernet Device which is redundant in Guest |
| 89 | + but is required to pass the ibdevice GID using its MAC. |
| 90 | + Examples: |
| 91 | + For an rxe backend using eth0 interface it will use its mac: |
| 92 | + -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC> |
| 93 | + For an SRIOV VF, we take the Ethernet Interface exposed by it: |
| 94 | + -device vmxnet3,multifunction=on,mac=<RoCE eth MAC> |
| 95 | + - Function 1 is the actual device: |
| 96 | + -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port> |
| 97 | + where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4) |
| 98 | + Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC. |
| 99 | + The rules of conversion are part of the RoCE spec, but since manual conversion |
| 100 | + is not required, spotting problems is not hard: |
| 101 | + Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a |
| 102 | + MAC: 7c:fe:90:cb:74:3a |
| 103 | + Note the difference between the first byte of the MAC and the GID. |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | +4. Implementation details |
| 108 | +========================= |
| 109 | + |
| 110 | + |
| 111 | +4.1 Overview |
| 112 | +============ |
| 113 | +The device acts like a proxy between the Guest Driver and the host |
| 114 | +ibdevice interface. |
| 115 | +On configuration path: |
| 116 | + - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request |
| 117 | + a resource from the backend interface, maintaining a 1-1 mapping |
| 118 | + between the guest and host. |
| 119 | +On data path: |
| 120 | + - Every post_send/receive received from the guest will be converted into |
| 121 | + a post_send/receive for the backend. The buffers data will not be touched |
| 122 | + or copied resulting in near bare-metal performance for large enough buffers. |
| 123 | + - Completions from the backend interface will result in completions for |
| 124 | + the pvrdma device. |
| 125 | + |
| 126 | + |
| 127 | +4.2 PCI BARs |
| 128 | +============ |
| 129 | +PCI Bars: |
| 130 | + BAR 0 - MSI-X |
| 131 | + MSI-X vectors: |
| 132 | + (0) Command - used when execution of a command is completed. |
| 133 | + (1) Async - not in use. |
| 134 | + (2) Completion - used when a completion event is placed in |
| 135 | + device's CQ ring. |
| 136 | + BAR 1 - Registers |
| 137 | + -------------------------------------------------------- |
| 138 | + | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | |
| 139 | + -------------------------------------------------------- |
| 140 | + DSR - Address of driver/device shared memory used |
| 141 | + for the command channel, used for passing: |
| 142 | + - General info such as driver version |
| 143 | + - Address of 'command' and 'response' |
| 144 | + - Address of async ring |
| 145 | + - Address of device's CQ ring |
| 146 | + - Device capabilities |
| 147 | + CTL - Device control operations (activate, reset etc) |
| 148 | + IMG - Set interrupt mask |
| 149 | + REQ - Command execution register |
| 150 | + ERR - Operation status |
| 151 | + |
| 152 | + BAR 2 - UAR |
| 153 | + --------------------------------------------------------- |
| 154 | + | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | |
| 155 | + --------------------------------------------------------- |
| 156 | + - Offset 0 used for QP operations (send and recv) |
| 157 | + - Offset 4 used for CQ operations (arm and poll) |
| 158 | + |
| 159 | + |
| 160 | +4.3 Major flows |
| 161 | +=============== |
| 162 | + |
| 163 | +4.3.1 Create CQ |
| 164 | +=============== |
| 165 | + - Guest driver |
| 166 | + - Allocates pages for CQ ring |
| 167 | + - Creates page directory (pdir) to hold CQ ring's pages |
| 168 | + - Initializes CQ ring |
| 169 | + - Initializes 'Create CQ' command object (cqe, pdir etc) |
| 170 | + - Copies the command to 'command' address |
| 171 | + - Writes 0 into REQ register |
| 172 | + - Device |
| 173 | + - Reads the request object from the 'command' address |
| 174 | + - Allocates CQ object and initialize CQ ring based on pdir |
| 175 | + - Creates the backend CQ |
| 176 | + - Writes operation status to ERR register |
| 177 | + - Posts command-interrupt to guest |
| 178 | + - Guest driver |
| 179 | + - Reads the HW response code from ERR register |
| 180 | + |
| 181 | +4.3.2 Create QP |
| 182 | +=============== |
| 183 | + - Guest driver |
| 184 | + - Allocates pages for send and receive rings |
| 185 | + - Creates page directory(pdir) to hold the ring's pages |
| 186 | + - Initializes 'Create QP' command object (max_send_wr, |
| 187 | + send_cq_handle, recv_cq_handle, pdir etc) |
| 188 | + - Copies the object to 'command' address |
| 189 | + - Write 0 into REQ register |
| 190 | + - Device |
| 191 | + - Reads the request object from 'command' address |
| 192 | + - Allocates the QP object and initialize |
| 193 | + - Send and recv rings based on pdir |
| 194 | + - Send and recv ring state |
| 195 | + - Creates the backend QP |
| 196 | + - Writes the operation status to ERR register |
| 197 | + - Posts command-interrupt to guest |
| 198 | + - Guest driver |
| 199 | + - Reads the HW response code from ERR register |
| 200 | + |
| 201 | +4.3.3 Post receive |
| 202 | +================== |
| 203 | + - Guest driver |
| 204 | + - Initializes a wqe and place it on recv ring |
| 205 | + - Write to qpn|qp_recv_bit (31) to QP offset in UAR |
| 206 | + - Device |
| 207 | + - Extracts qpn from UAR |
| 208 | + - Walks through the ring and does the following for each wqe |
| 209 | + - Prepares the backend CQE context to be used when |
| 210 | + receiving completion from backend (wr_id, op_code, emu_cq_num) |
| 211 | + - For each sge prepares backend sge |
| 212 | + - Calls backend's post_recv |
| 213 | + |
| 214 | +4.3.4 Process backend events |
| 215 | +============================ |
| 216 | + - Done by a dedicated thread used to process backend events; |
| 217 | + at initialization is attached to the device and creates |
| 218 | + the communication channel. |
| 219 | + - Thread main loop: |
| 220 | + - Polls for completions |
| 221 | + - Extracts QEMU _cq_num, wr_id and op_code from context |
| 222 | + - Writes CQE to CQ ring |
| 223 | + - Writes CQ number to device CQ |
| 224 | + - Sends completion-interrupt to guest |
| 225 | + - Deallocates context |
| 226 | + - Acks the event to backend |
| 227 | + |
| 228 | + |
| 229 | + |
| 230 | +5. Limitations |
| 231 | +============== |
| 232 | +- The device obviously is limited by the Guest Linux Driver features implementation |
| 233 | + of the VMware device API. |
| 234 | +- Memory registration mechanism requires mremap for every page in the buffer in order |
| 235 | + to map it to a contiguous virtual address range. Since this is not the data path |
| 236 | + it should not matter much. If the default max mr size is increased, be aware that |
| 237 | + memory registration can take up to 0.5 seconds for 1GB of memory. |
| 238 | +- The device requires target page size to be the same as the host page size, |
| 239 | + otherwise it will fail to init. |
| 240 | +- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, |
| 241 | + so it can't work with huge pages. The limitation will be addressed in the future, |
| 242 | + however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge |
| 243 | + pages available, QEMU will use them. QEMU will fail to init if the requirements |
| 244 | + are not met. |
| 245 | + |
| 246 | + |
| 247 | + |
| 248 | +6. Performance |
| 249 | +============== |
| 250 | +By design the pvrdma device exits on each post-send/receive, so for small buffers |
| 251 | +the performance is affected; however for medium buffers it will became close to |
| 252 | +bare metal and from 1MB buffers and up it reaches bare metal performance. |
| 253 | +(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) |
| 254 | + |
| 255 | +All the above assumes no memory registration is done on data path. |
0 commit comments