Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

Open
PhdShi opened this issue Oct 15, 2024 · 5 comments
Open

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

PhdShi opened this issue Oct 15, 2024 · 5 comments

Comments

@PhdShi
Copy link

PhdShi commented Oct 15, 2024

I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command will hang after running allreduce_perf without generating corresponding files.
Here is my run script:

#!/usr/bin
/usr/local/mpi/bin/mpirun --allow-run-as-root --mca btl_openib_warn_no_device_params_found 0 --mca btl_tcp_if_include bond0 --hostfile iplist  --map-by ppr:8:node -np 8 -x NCCL_IB_TC=136 -x NCCL_IB_SL=5 -x NCCL_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME=bond -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5 -x NCCL_IB_TIMEOUT=22 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_NET_PLUGIN=none -x NCCL_ALGO=Ring -x NCCL_P2P_USE_CUDA_MEMCPY=1 -x LD_PRELOAD=/workspace/nccl2.21.5/build/lib/libnccl.so.2 /usr/bin/all_reduce_perf -b 4k -e 8G -g 1 -f 2 -n 50 -w 10 

This is my executive command: nsys profile -o allreduce_ce_default.nsys-rep bash runtest.sh

NGC image version:nvcr.io/nvidia/pytorch:24.06-py3

@sjeaugey
Copy link
Member

Why are you setting NCCL_P2P_USE_CUDA_MEMCPY?

@PhdShi
Copy link
Author

PhdShi commented Oct 15, 2024

Why are you setting NCCL_P2P_USE_CUDA_MEMCPY?

I noticed that this #922 issue mentioned that turning on NCCL_P2P_USE_CUDA_MEMCPY would have some performance improvements, and I would like to test it. But my test data shows that the NCCL_P2P_USE_CUDA_MEMCPY causes poor performance for allreduce large messages.

@sjeaugey
Copy link
Member

sjeaugey commented Oct 15, 2024

Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.

@PhdShi
Copy link
Author

PhdShi commented Oct 15, 2024

Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.

Can you explain why the performance decline is expected? I use the cuda-samples/p2pBandwidthLatencyTest test, found that using the copy engine performance is much better than sm_copy. Does this mean that allreduce should have a better performance when enabling NCCL_P2P_USE_CUDA_MEMCPY

@sjeaugey
Copy link
Member

NCCL_P2P_USE_CUDA_MEMCPY is not doing what you think. No, it won't improve performance on your system, unless you have a system where SM-based copy is a disaster (like 10x slower than CE). Then it could help -- sometimes. Again, don't use undocumented environment variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants