ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR #126

francisguillier · 2021-10-04T16:50:19Z

Hi,

we tried to test GPUDirect RDMA.

Test pod deployed from https://github.com/Mellanox/k8s-images

we deployed 2 pods:

Server pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0

Waiting for client to connect... *

Client pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 192.168.111.1
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 02:00

Picking device No. 0
[pid = 56, dev = 0] device name = [NVIDIA A30-8C]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 262144 bytes GPU buffer
allocated GPU buffer address at 0000010013000000 pointer=0x10013000000
Couldn't allocate MR
failed to create mr
Failed to create MR
Failed to initialize RDMA contexts.
ERRNO: Bad address.
Failed to handle RDMA CM event.
ERRNO: Bad address.
Failed to connect RDMA CM events.
ERRNO: Bad address.
Segmentation fault (core dumped)

what does "Couldn't allocate MR" mean?

thanks in advance

francisguillier · 2021-10-04T16:58:33Z

Sorry: to provide some more context:
I am testing GPU Operator + Network Operator.
nv-peermem has been enabled with GPU Operator deployment

wangku0 · 2023-05-24T07:39:42Z

Hi！
Have you solved this problem yet？I have also encountered this problem and would like to ask you how to solve it.
thanks in advance.

zpkhor · 2024-01-09T11:44:53Z

try sudo modprobe nvidia-peermem
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-nic-communication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR #126

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR #126

francisguillier commented Oct 4, 2021

francisguillier commented Oct 4, 2021

wangku0 commented May 24, 2023

zpkhor commented Jan 9, 2024 •

edited

Loading

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR #126

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR #126

Comments

francisguillier commented Oct 4, 2021

francisguillier commented Oct 4, 2021

wangku0 commented May 24, 2023

zpkhor commented Jan 9, 2024 • edited Loading

zpkhor commented Jan 9, 2024 •

edited

Loading