The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

PhdShi · 2024-10-15T08:22:45Z

I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command will hang after running allreduce_perf without generating corresponding files.
Here is my run script:

#!/usr/bin
/usr/local/mpi/bin/mpirun --allow-run-as-root --mca btl_openib_warn_no_device_params_found 0 --mca btl_tcp_if_include bond0 --hostfile iplist  --map-by ppr:8:node -np 8 -x NCCL_IB_TC=136 -x NCCL_IB_SL=5 -x NCCL_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME=bond -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5 -x NCCL_IB_TIMEOUT=22 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_NET_PLUGIN=none -x NCCL_ALGO=Ring -x NCCL_P2P_USE_CUDA_MEMCPY=1 -x LD_PRELOAD=/workspace/nccl2.21.5/build/lib/libnccl.so.2 /usr/bin/all_reduce_perf -b 4k -e 8G -g 1 -f 2 -n 50 -w 10

This is my executive command: nsys profile -o allreduce_ce_default.nsys-rep bash runtest.sh

NGC image version:nvcr.io/nvidia/pytorch:24.06-py3

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-10-15T08:39:32Z

Why are you setting NCCL_P2P_USE_CUDA_MEMCPY?

PhdShi · 2024-10-15T08:53:20Z

Why are you setting NCCL_P2P_USE_CUDA_MEMCPY?

I noticed that this #922 issue mentioned that turning on NCCL_P2P_USE_CUDA_MEMCPY would have some performance improvements, and I would like to test it. But my test data shows that the NCCL_P2P_USE_CUDA_MEMCPY causes poor performance for allreduce large messages.

sjeaugey · 2024-10-15T09:02:37Z

Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.

PhdShi · 2024-10-15T09:18:01Z

Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.

Can you explain why the performance decline is expected? I use the cuda-samples/p2pBandwidthLatencyTest test, found that using the copy engine performance is much better than sm_copy. Does this mean that allreduce should have a better performance when enabling NCCL_P2P_USE_CUDA_MEMCPY

sjeaugey · 2024-10-15T13:49:13Z

NCCL_P2P_USE_CUDA_MEMCPY is not doing what you think. No, it won't improve performance on your system, unless you have a system where SM-based copy is a disaster (like 10x slower than CE). Then it could help -- sometimes. Again, don't use undocumented environment variables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

PhdShi commented Oct 15, 2024

sjeaugey commented Oct 15, 2024

PhdShi commented Oct 15, 2024

sjeaugey commented Oct 15, 2024 •

edited

Loading

PhdShi commented Oct 15, 2024

sjeaugey commented Oct 15, 2024

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

Comments

PhdShi commented Oct 15, 2024

sjeaugey commented Oct 15, 2024

PhdShi commented Oct 15, 2024

sjeaugey commented Oct 15, 2024 • edited Loading

PhdShi commented Oct 15, 2024

sjeaugey commented Oct 15, 2024

sjeaugey commented Oct 15, 2024 •

edited

Loading