-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf About HSA_FORCE_FINE_GRAIN_PCIE=1
#92
Comments
We are still in the process of optimizing the RCCL integration with TensorFlow. We are aware that the performance of our preliminary integration can vary significantly from run to run, with our without In the meantime, we have found the following TF CNN Benchmark command-line option helpful for reproducible performance. This option will all-reduce all of the gradients in one shot, rather than one at a time while back propagation is taking place.
|
@jeffdaily Is current rccl more stable and efficient? |
@wenkaidu please comment on RCCL itself, and please address the use of @ghostplant we have made a few improvements to the TensorFlow RCCL integration that significantly improves performance. Performance improvements have already made their way into the upstream repo via PRs. The one remaining PR concerns test correctness tensorflow/tensorflow#32296. If you are using our TF fork, consider using the |
@jeffdaily Thanks. I have some 4-gpu machines of AMDGPU. Some of them report "P2P Access Supported between different cards", but others report not supported. |
@jeffdaily Besides, I am using rccl for rocm-2.6, and I saw the RCCL::ncclAllreduce performance keep dropping: [GPU: 4/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0% (560.31 images/sec)
[GPU: 2/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0% (561.83 images/sec)
[GPU: 3/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0% (561.36 images/sec)
[GPU: 1/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0% (559.70 images/sec)
[GPU: 4/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0% (544.37 images/sec)
[GPU: 2/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0% (544.37 images/sec)
[GPU: 3/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0% (544.41 images/sec)
[GPU: 1/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0% (544.42 images/sec)
[GPU: 4/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0% (508.63 images/sec)
[GPU: 3/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0% (508.67 images/sec)
[GPU: 2/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0% (508.67 images/sec)
[GPU: 1/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0% (508.93 images/sec)
[GPU: 3/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0% (476.16 images/sec)
[GPU: 4/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0% (476.15 images/sec)
[GPU: 2/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0% (476.17 images/sec)
[GPU: 1/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0% (476.65 images/sec)
[GPU: 3/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0% (439.61 images/sec)
[GPU: 4/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0% (439.68 images/sec)
[GPU: 2/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0% (439.81 images/sec) It also happens in tf_cnn_benchmark using |
@ghostplant HSA_FORCE_FINE_GRAIN_PCIE=1 is always needed for PCIe P2P. If you are connecting GPUs using XGMI, then this flag will not be required. Hardware (CPU and motherboard), CPU BIOS (enable PCIe large bar), GPU VBIOS can all affect multi-GPU PCIe P2P support. If GPUs/VBIOS are identical, but some machines don't have PCIe P2P and some have, then it is on CPU side. Typically PCIe P2P requires all GPUs under same PLX bridge. You can confirm by checking "lspci -t -v". There are some improvements on RCCL performance and stability from ROCm2.8. But you need to update to recent ROCm because latest RCCL cannot be built on ROCm 2.6. |
@wenkaidu Thanks, does this log have a bad impact on the inter-node rccl performance? [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so) |
No, that log message is from NCCL's plugin architecture and can be ignored. NCCL comes with sockets and IB verbs. This plugin allows for additional network implementations such as Mellanox SHARP. |
@jeffdaily Thanks! |
HSA_FORCE_FINE_GRAIN_PCIE=1 is always needed for PCIe P2P. If you are connecting GPUs using XGMI, then this flag will not be required. Ref: ROCm/rccl#92
Does
HSA_FORCE_FINE_GRAIN_PCIE=1
improve the performance?By evaluating
HSA_FORCE_FINE_GRAIN_PCIE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
, I saw the bandwidth is much higher than not settingHSA_FORCE_FINE_GRAIN_PCIE=1
, and the time cost in us is also shorten.However, when enabling
HSA_FORCE_FINE_GRAIN_PCIE=1
for tensorflow-rocm and evaluating rccl allreduce benchmark byHSA_FORCE_FINE_GRAIN_PCIE=1 python3 tf_cnn_benchmarks.py --num_gpus=4 --model resnet50 --batch_size=128 --variable_update=replicated --all_reduce_spec=nccl
, the performance is much worse.The text was updated successfully, but these errors were encountered: