-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,PHB, PXB GDRDMA. but good performance on BM. #1464
Comments
So, are you asking how to inject a topology inside NCCL? Would that comment help: |
Thank you for update. Yes, Exactly what I want. Where can I get the NCCL_TOPO_FILE ? |
Hi, SJ Where can I find the NCCL topology xml file? I couldn't find it from below? NCCL_TOPO_FILE
|
If your cloud provider doesn't provide one (or if you are launching your VM yourself), you'd need to write it, based on the physical topology. |
Hi, I need your help seriously.
First, NVIDIA HW/SW component for intra-node and inter-node communication are working well as point of compatibility, enabled.
NVSwitch (NVlink, Fabric Manager), GPU (GPU Driver, CUDA, NCCL), HCA(OFED, nv_peer_mem or nvidia_peermem) and so on.
NCCL 2.23.4 for CUDA 12.2 on ProxMox Hypervisor.
Ubuntu 22.04.5 GPU Driver for DC 535.183.06 Mellanox OFED 23.10-3.2.2.
nv_peer_mem 535.183.06 (Assembled in GPU driver) : installed manually too.
InfiniBand RDMA performance is also good (Not GPU Direct RDMA), the same as BareMetal environment according to IB_send, IB_receive BW, Latency results.
The problem is the efficient GPU Direct RDMA (PIX, PHX, PXB) does not work in NCCL due to the intra-node topology recognized by the VM, as explained below.
As you know, GPURDMA has performance effects only when the HCA of GPU memory is transmitted through the PCIe switch.
NCCL log (As you can see, GPU Direct RDMA was enabled, but then disabled due to topology differences.)
Enabled
Disabled
So, we checked by expanding GPU Direct RDMA to system memory copy level (via CPU – which does not have a significant performance advantage)
NCCL_NET_GDR_LEVEL=SYS :
NCCL log
GPU Direct RDMA via system memory copy
Currently, what we most want is the same intro-node topology (nvidia-smi topo -m) as BareMetal,
But, with the help of NVIDIA, we want to make GPU Direct RDMA works at the PIX, PXB, and PHB levels first.
The key is how to configure a VM environment (GPU Affinity with CPU Affinity) that allows topology (PIX, PHB, PXB) in GPU Direct RDMA.
VM intra topology
BM intra topology
mpirun
-np 16
-N 8
--bind-to socket
--hostfile /home/singtel/nccl-tests/hosts-2
-x NCCL_IB_CUDA_SUPPORT=1
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-x LD_LIBRARY_PATH
-x NCCL_UCX_TLS=rc_x,cuda_copy
-x NCCL_UCX_RNDV_THRESH=0
-x UCX_MEMTYPE_CACHE=n
-x NCCL_COLLNET_ENABLE=0
-x NCCL_PLUGIN_P2P=ucx
-x NCCL_DEBUG=info
-x NCCL_DEBUG_SUBSYS=NET
-x NCCL_IB_HCA=mlx5
/home/singtel/nccl-tests/build/all_reduce_perf -b 128 -e 16G -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1
The text was updated successfully, but these errors were encountered: