Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464

Open
dobiup opened this issue Sep 28, 2024 · 4 comments

Comments

@dobiup
Copy link

dobiup commented Sep 28, 2024

Hi, I need your help seriously.

First, NVIDIA HW/SW component for intra-node and inter-node communication are working well as point of compatibility, enabled.
NVSwitch (NVlink, Fabric Manager), GPU (GPU Driver, CUDA, NCCL), HCA(OFED, nv_peer_mem or nvidia_peermem) and so on.

NCCL 2.23.4 for CUDA 12.2 on ProxMox Hypervisor.
Ubuntu 22.04.5 GPU Driver for DC 535.183.06 Mellanox OFED 23.10-3.2.2.
nv_peer_mem 535.183.06 (Assembled in GPU driver) : installed manually too.

InfiniBand RDMA performance is also good (Not GPU Direct RDMA), the same as BareMetal environment according to IB_send, IB_receive BW, Latency results.
The problem is the efficient GPU Direct RDMA (PIX, PHX, PXB) does not work in NCCL due to the intra-node topology recognized by the VM, as explained below.
As you know, GPURDMA has performance effects only when the HCA of GPU memory is transmitted through the PCIe switch.

NCCL log (As you can see, GPU Direct RDMA was enabled, but then disabled due to topology differences.)
Enabled
Image
Disabled
Image

So, we checked by expanding GPU Direct RDMA to system memory copy level (via CPU – which does not have a significant performance advantage)
NCCL_NET_GDR_LEVEL=SYS :
NCCL log
Image
GPU Direct RDMA via system memory copy
Image

Currently, what we most want is the same intro-node topology (nvidia-smi topo -m) as BareMetal,
But, with the help of NVIDIA, we want to make GPU Direct RDMA works at the PIX, PXB, and PHB levels first.

The key is how to configure a VM environment (GPU Affinity with CPU Affinity) that allows topology (PIX, PHB, PXB) in GPU Direct RDMA.

VM intra topology
Image

BM intra topology
Image

mpirun
-np 16
-N 8
--bind-to socket
--hostfile /home/singtel/nccl-tests/hosts-2
-x NCCL_IB_CUDA_SUPPORT=1
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-x LD_LIBRARY_PATH
-x NCCL_UCX_TLS=rc_x,cuda_copy
-x NCCL_UCX_RNDV_THRESH=0
-x UCX_MEMTYPE_CACHE=n
-x NCCL_COLLNET_ENABLE=0
-x NCCL_PLUGIN_P2P=ucx
-x NCCL_DEBUG=info
-x NCCL_DEBUG_SUBSYS=NET
-x NCCL_IB_HCA=mlx5
/home/singtel/nccl-tests/build/all_reduce_perf -b 128 -e 16G -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1

Image
Image

@sjeaugey
Copy link
Member

So, are you asking how to inject a topology inside NCCL?

Would that comment help:
NVIDIA/nccl-tests#86 (comment)

@dobiup
Copy link
Author

dobiup commented Sep 30, 2024

Thank you for update.

Yes, Exactly what I want. Where can I get the NCCL_TOPO_FILE ?
Only VM provider (ProxMox or KVM) can provide this?

@dobiup
Copy link
Author

dobiup commented Oct 1, 2024

Hi, SJ

Where can I find the NCCL topology xml file? I couldn't find it from below?

NCCL_TOPO_FILE
(since 2.6)
Path to an XML file to load before detecting the topology. By default, NCCL will load /var/run/nvidia-topologyd/virtualTopology.xml if present.

// Try default XML topology location
NCCLCHECKGOTO(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0), ret, fail);

@sjeaugey
Copy link
Member

sjeaugey commented Oct 1, 2024

If your cloud provider doesn't provide one (or if you are launching your VM yourself), you'd need to write it, based on the physical topology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants