Perf About `HSA_FORCE_FINE_GRAIN_PCIE=1` #92

ghostplant · 2019-07-01T17:00:51Z

Does HSA_FORCE_FINE_GRAIN_PCIE=1 improve the performance?
By evaluating HSA_FORCE_FINE_GRAIN_PCIE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4, I saw the bandwidth is much higher than not setting HSA_FORCE_FINE_GRAIN_PCIE=1, and the time cost in us is also shorten.

However, when enabling HSA_FORCE_FINE_GRAIN_PCIE=1 for tensorflow-rocm and evaluating rccl allreduce benchmark by HSA_FORCE_FINE_GRAIN_PCIE=1 python3 tf_cnn_benchmarks.py --num_gpus=4 --model resnet50 --batch_size=128 --variable_update=replicated --all_reduce_spec=nccl, the performance is much worse.

The text was updated successfully, but these errors were encountered:

jeffdaily · 2019-07-02T17:12:10Z

We are still in the process of optimizing the RCCL integration with TensorFlow. We are aware that the performance of our preliminary integration can vary significantly from run to run, with our without HSA_FORCE_FINE_GRAIN_PCIE=1.

In the meantime, we have found the following TF CNN Benchmark command-line option helpful for reproducible performance. This option will all-reduce all of the gradients in one shot, rather than one at a time while back propagation is taking place.

--gradient_repacking=1

ghostplant · 2019-10-10T15:21:37Z

@jeffdaily Is current rccl more stable and efficient?

jeffdaily · 2019-10-10T16:14:40Z

@wenkaidu please comment on RCCL itself, and please address the use of HSA_FORCE_FINE_GRAIN_PCIE. Thanks.

@ghostplant we have made a few improvements to the TensorFlow RCCL integration that significantly improves performance. Performance improvements have already made their way into the upstream repo via PRs. The one remaining PR concerns test correctness tensorflow/tensorflow#32296.

If you are using our TF fork, consider using the r1.15-rocm or develop-upstream branches. Those branches have all TF+RCCL performance enhancements that we have made to date.

ghostplant · 2019-10-10T16:24:59Z

@jeffdaily Thanks. I have some 4-gpu machines of AMDGPU. Some of them report "P2P Access Supported between different cards", but others report not supported.
I tested their rccl performance respectively, and I saw those supporting P2P Access have nearly-expected scaling performance, but others have a very bad performance. Seems like it is whether the multi-GPU support P2P Access that makes a difference.
My question is that for multi-AMDGPU, is "Multi-GPU P2P Access" determized by only hardware, so that it is not possible to fix their inter-GPU memcpyDtoD performance by ROCM software/driver stack?

ghostplant · 2019-10-10T16:52:12Z

@jeffdaily Besides, I am using rccl for rocm-2.6, and I saw the RCCL::ncclAllreduce performance keep dropping:

[GPU: 4/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (560.31 images/sec)
[GPU: 2/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (561.83 images/sec)
[GPU: 3/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (561.36 images/sec)
[GPU: 1/4] step = 51, loss = 6.8023, top1 = 100.0%, top5 = 100.0%, val_loss = 6.8023, val_top1 = 100.0%, val_top5 = 100.0%  (559.70 images/sec)
[GPU: 4/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.37 images/sec)
[GPU: 2/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.37 images/sec)
[GPU: 3/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.41 images/sec)
[GPU: 1/4] step = 101, loss = 6.7923, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7923, val_top1 = 100.0%, val_top5 = 100.0%  (544.42 images/sec)
[GPU: 4/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.63 images/sec)
[GPU: 3/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.67 images/sec)
[GPU: 2/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.67 images/sec)
[GPU: 1/4] step = 151, loss = 6.7823, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7823, val_top1 = 100.0%, val_top5 = 100.0%  (508.93 images/sec)
[GPU: 3/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.16 images/sec)
[GPU: 4/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.15 images/sec)
[GPU: 2/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.17 images/sec)
[GPU: 1/4] step = 201, loss = 6.7723, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7723, val_top1 = 100.0%, val_top5 = 100.0%  (476.65 images/sec)
[GPU: 3/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0%  (439.61 images/sec)
[GPU: 4/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0%  (439.68 images/sec)
[GPU: 2/4] step = 251, loss = 6.7623, top1 = 100.0%, top5 = 100.0%, val_loss = 6.7623, val_top1 = 100.0%, val_top5 = 100.0%  (439.81 images/sec)

It also happens in tf_cnn_benchmark using --local_parameter_device=gpu --variable_update=replicated --all_reduce_spec=nccl. Did you ever know this issue and is it solved in rccl for rocm-2.8/2.9?

wenkaidu · 2019-10-10T17:47:18Z

@ghostplant HSA_FORCE_FINE_GRAIN_PCIE=1 is always needed for PCIe P2P. If you are connecting GPUs using XGMI, then this flag will not be required. Hardware (CPU and motherboard), CPU BIOS (enable PCIe large bar), GPU VBIOS can all affect multi-GPU PCIe P2P support. If GPUs/VBIOS are identical, but some machines don't have PCIe P2P and some have, then it is on CPU side. Typically PCIe P2P requires all GPUs under same PLX bridge. You can confirm by checking "lspci -t -v".

There are some improvements on RCCL performance and stability from ROCm2.8. But you need to update to recent ROCm because latest RCCL cannot be built on ROCm 2.6.

ghostplant · 2019-10-14T05:20:12Z

@wenkaidu Thanks, does this log have a bad impact on the inter-node rccl performance?

[1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so)

jeffdaily · 2019-10-14T14:35:44Z

No, that log message is from NCCL's plugin architecture and can be ignored. NCCL comes with sockets and IB verbs. This plugin allows for additional network implementations such as Mellanox SHARP.

ghostplant · 2019-11-09T19:57:14Z

@jeffdaily Thanks!

HSA_FORCE_FINE_GRAIN_PCIE=1 is always needed for PCIe P2P. If you are connecting GPUs using XGMI, then this flag will not be required. Ref: ROCm/rccl#92

ghostplant closed this as completed Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf About `HSA_FORCE_FINE_GRAIN_PCIE=1` #92

Perf About `HSA_FORCE_FINE_GRAIN_PCIE=1` #92

ghostplant commented Jul 1, 2019 •

edited

Loading

jeffdaily commented Jul 2, 2019

ghostplant commented Oct 10, 2019

jeffdaily commented Oct 10, 2019

ghostplant commented Oct 10, 2019 •

edited

Loading

ghostplant commented Oct 10, 2019 •

edited

Loading

wenkaidu commented Oct 10, 2019

ghostplant commented Oct 14, 2019 •

edited

Loading

jeffdaily commented Oct 14, 2019

ghostplant commented Nov 9, 2019

Perf About HSA_FORCE_FINE_GRAIN_PCIE=1 #92

Perf About HSA_FORCE_FINE_GRAIN_PCIE=1 #92

Comments

ghostplant commented Jul 1, 2019 • edited Loading

jeffdaily commented Jul 2, 2019

ghostplant commented Oct 10, 2019

jeffdaily commented Oct 10, 2019

ghostplant commented Oct 10, 2019 • edited Loading

ghostplant commented Oct 10, 2019 • edited Loading

wenkaidu commented Oct 10, 2019

ghostplant commented Oct 14, 2019 • edited Loading

jeffdaily commented Oct 14, 2019

ghostplant commented Nov 9, 2019

Perf About `HSA_FORCE_FINE_GRAIN_PCIE=1` #92

Perf About `HSA_FORCE_FINE_GRAIN_PCIE=1` #92

ghostplant commented Jul 1, 2019 •

edited

Loading

ghostplant commented Oct 10, 2019 •

edited

Loading

ghostplant commented Oct 10, 2019 •

edited

Loading

ghostplant commented Oct 14, 2019 •

edited

Loading