Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf About HSA_FORCE_FINE_GRAIN_PCIE=1 #92

Closed
ghostplant opened this issue Jul 1, 2019 · 9 comments
Closed

Perf About HSA_FORCE_FINE_GRAIN_PCIE=1 #92

ghostplant opened this issue Jul 1, 2019 · 9 comments

Comments

@ghostplant
Copy link

ghostplant commented Jul 1, 2019

Does HSA_FORCE_FINE_GRAIN_PCIE=1 improve the performance?
By evaluating HSA_FORCE_FINE_GRAIN_PCIE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4, I saw the bandwidth is much higher than not setting HSA_FORCE_FINE_GRAIN_PCIE=1, and the time cost in us is also shorten.

However, when enabling HSA_FORCE_FINE_GRAIN_PCIE=1 for tensorflow-rocm and evaluating rccl allreduce benchmark by HSA_FORCE_FINE_GRAIN_PCIE=1 python3 tf_cnn_benchmarks.py --num_gpus=4 --model resnet50 --batch_size=128 --variable_update=replicated --all_reduce_spec=nccl, the performance is much worse.

@jeffdaily
Copy link
Contributor

We are still in the process of optimizing the RCCL integration with TensorFlow. We are aware that the performance of our preliminary integration can vary significantly from run to run, with our without HSA_FORCE_FINE_GRAIN_PCIE=1.

In the meantime, we have found the following TF CNN Benchmark command-line option helpful for reproducible performance. This option will all-reduce all of the gradients in one shot, rather than one at a time while back propagation is taking place.

--gradient_repacking=1

@ghostplant
Copy link
Author

@jeffdaily Is current rccl more stable and efficient?

@jeffdaily
Copy link
Contributor

@wenkaidu please comment on RCCL itself, and please address the use of HSA_FORCE_FINE_GRAIN_PCIE. Thanks.

@ghostplant we have made a few improvements to the TensorFlow RCCL integration that significantly improves performance. Performance improvements have already made their way into the upstream repo via PRs. The one remaining PR concerns test correctness tensorflow/tensorflow#32296.

If you are using our TF fork, consider using the r1.15-rocm or develop-upstream branches. Those branches have all TF+RCCL performance enhancements that we have made to date.

@ghostplant
Copy link
Author

ghostplant commented Oct 10, 2019

@jeffdaily Thanks. I have some 4-gpu machines of AMDGPU. Some of them report "P2P Access Supported between different cards", but others report not supported.
I tested their rccl performance respectively, and I saw those supporting P2P Access have nearly-expected scaling performance, but others have a very bad performance. Seems like it is whether the multi-GPU support P2P Access that makes a difference.
My question is that for multi-AMDGPU, is "Multi-GPU P2P Access" determized by only hardware, so that it is not possible to fix their inter-GPU memcpyDtoD performance by ROCM software/driver stack?

@ghostplant
Copy link
Author

ghostplant commented Oct 10, 2019

@jeffdaily Besides, I am using rccl for rocm-2.6, and I saw the RCCL::ncclAllreduce performance keep dropping:

[GPU: 4/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (560.31 images/sec)
[GPU: 2/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (561.83 images/sec)
[GPU: 3/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (561.36 images/sec)
[GPU: 1/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (559.70 images/sec)
[GPU: 4/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.37 images/sec)
[GPU: 2/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.37 images/sec)
[GPU: 3/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.41 images/sec)
[GPU: 1/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.42 images/sec)
[GPU: 4/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.63 images/sec)
[GPU: 3/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.67 images/sec)
[GPU: 2/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.67 images/sec)
[GPU: 1/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.93 images/sec)
[GPU: 3/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.16 images/sec)
[GPU: 4/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.15 images/sec)
[GPU: 2/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.17 images/sec)
[GPU: 1/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.65 images/sec)
[GPU: 3/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0%  (439.61 images/sec)
[GPU: 4/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0%  (439.68 images/sec)
[GPU: 2/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0%  (439.81 images/sec)

It also happens in tf_cnn_benchmark using --local_parameter_device=gpu --variable_update=replicated --all_reduce_spec=nccl. Did you ever know this issue and is it solved in rccl for rocm-2.8/2.9?

@wenkaidu
Copy link
Collaborator

@ghostplant HSA_FORCE_FINE_GRAIN_PCIE=1 is always needed for PCIe P2P. If you are connecting GPUs using XGMI, then this flag will not be required. Hardware (CPU and motherboard), CPU BIOS (enable PCIe large bar), GPU VBIOS can all affect multi-GPU PCIe P2P support. If GPUs/VBIOS are identical, but some machines don't have PCIe P2P and some have, then it is on CPU side. Typically PCIe P2P requires all GPUs under same PLX bridge. You can confirm by checking "lspci -t -v".

There are some improvements on RCCL performance and stability from ROCm2.8. But you need to update to recent ROCm because latest RCCL cannot be built on ROCm 2.6.

@ghostplant
Copy link
Author

ghostplant commented Oct 14, 2019

@wenkaidu Thanks, does this log have a bad impact on the inter-node rccl performance?

[1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

@jeffdaily
Copy link
Contributor

No, that log message is from NCCL's plugin architecture and can be ignored. NCCL comes with sockets and IB verbs. This plugin allows for additional network implementations such as Mellanox SHARP.

@ghostplant
Copy link
Author

@jeffdaily Thanks!

hubertlu-tw added a commit to ROCm/nccl-rccl-parser that referenced this issue Jul 28, 2021
HSA_FORCE_FINE_GRAIN_PCIE=1 is always needed for PCIe P2P. If you are connecting GPUs using XGMI, then this flag will not be required. 
Ref: ROCm/rccl#92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants