NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

dobiup · 2024-09-28T05:34:35Z

Hi, I need your help seriously.

First, NVIDIA HW/SW component for intra-node and inter-node communication are working well as point of compatibility, enabled.
NVSwitch (NVlink, Fabric Manager), GPU (GPU Driver, CUDA, NCCL), HCA(OFED, nv_peer_mem or nvidia_peermem) and so on.

NCCL 2.23.4 for CUDA 12.2 on ProxMox Hypervisor.
Ubuntu 22.04.5 GPU Driver for DC 535.183.06 Mellanox OFED 23.10-3.2.2.
nv_peer_mem 535.183.06 (Assembled in GPU driver) : installed manually too.

InfiniBand RDMA performance is also good (Not GPU Direct RDMA), the same as BareMetal environment according to IB_send, IB_receive BW, Latency results.
The problem is the efficient GPU Direct RDMA (PIX, PHX, PXB) does not work in NCCL due to the intra-node topology recognized by the VM, as explained below.
As you know, GPURDMA has performance effects only when the HCA of GPU memory is transmitted through the PCIe switch.

NCCL log (As you can see, GPU Direct RDMA was enabled, but then disabled due to topology differences.)
Enabled

Disabled

So, we checked by expanding GPU Direct RDMA to system memory copy level (via CPU – which does not have a significant performance advantage)
NCCL_NET_GDR_LEVEL=SYS :
NCCL log

GPU Direct RDMA via system memory copy

Currently, what we most want is the same intro-node topology (nvidia-smi topo -m) as BareMetal,
But, with the help of NVIDIA, we want to make GPU Direct RDMA works at the PIX, PXB, and PHB levels first.

The key is how to configure a VM environment (GPU Affinity with CPU Affinity) that allows topology (PIX, PHB, PXB) in GPU Direct RDMA.

VM intra topology

BM intra topology

mpirun
-np 16
-N 8
--bind-to socket
--hostfile /home/singtel/nccl-tests/hosts-2
-x NCCL_IB_CUDA_SUPPORT=1
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-x LD_LIBRARY_PATH
-x NCCL_UCX_TLS=rc_x,cuda_copy
-x NCCL_UCX_RNDV_THRESH=0
-x UCX_MEMTYPE_CACHE=n
-x NCCL_COLLNET_ENABLE=0
-x NCCL_PLUGIN_P2P=ucx
-x NCCL_DEBUG=info
-x NCCL_DEBUG_SUBSYS=NET
-x NCCL_IB_HCA=mlx5
/home/singtel/nccl-tests/build/all_reduce_perf -b 128 -e 16G -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1

sjeaugey · 2024-09-30T09:12:03Z

So, are you asking how to inject a topology inside NCCL?

Would that comment help:
NVIDIA/nccl-tests#86 (comment)

dobiup · 2024-09-30T15:14:01Z

Thank you for update.

Yes, Exactly what I want. Where can I get the NCCL_TOPO_FILE ?
Only VM provider (ProxMox or KVM) can provide this?

dobiup · 2024-10-01T02:03:21Z

Hi, SJ

Where can I find the NCCL topology xml file? I couldn't find it from below?

NCCL_TOPO_FILE
(since 2.6)
Path to an XML file to load before detecting the topology. By default, NCCL will load /var/run/nvidia-topologyd/virtualTopology.xml if present.

// Try default XML topology location
NCCLCHECKGOTO(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0), ret, fail);

sjeaugey · 2024-10-01T07:11:40Z

If your cloud provider doesn't provide one (or if you are launching your VM yourself), you'd need to write it, based on the physical topology.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

dobiup commented Sep 28, 2024 •

edited

Loading

sjeaugey commented Sep 30, 2024

dobiup commented Sep 30, 2024

dobiup commented Oct 1, 2024

sjeaugey commented Oct 1, 2024

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

Comments

dobiup commented Sep 28, 2024 • edited Loading

sjeaugey commented Sep 30, 2024

dobiup commented Sep 30, 2024

dobiup commented Oct 1, 2024

sjeaugey commented Oct 1, 2024

dobiup commented Sep 28, 2024 •

edited

Loading