Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939

vasslavich · 2023-09-21T20:27:23Z

Hello, dear colleagues!

The problem of a scalability

I'm benchmarking collective OSU operations on IB. There are available 4 nodes with 3 rails.
For now I have the scalability ~3x for a case osu_bcast when I moved from 1 rail to 3 ones.
But, the scalability of osu_scatter is only ~1.8-2x.
I can't understand why the latency's scalability is so different? In general the both MPI operations (MPI_Bcast and MPI_Scatter) only send their source buffers to the receivers.
Thank you for any ideas!

osu_scatter scalability

Size	IB(1 rail)	IB(2 rails)	scaling(1r:2r)	IB(3 rails)	scaling(1r:3r)
1048576	95.9	63.81	1.50	52.63	1.82
2097152	207.51	139.68	1.49	102.24	2.03
4194304	454.62	278.5	1.63	218.01	2.09
8388608	898.91	530.83	1.69	426.01	2.11
16777216	1921.04	1131.58	1.70	899.27	2.14
33554432	4531.54	2970.59	1.53	2435	1.86
67108864	10203.65	6745.33	1.51	5595.11	1.82
134217728	21348.49	14114.07	1.51	11707.44	1.82
268435456	44219.3	28810.01	1.53	23595.44	1.87

bcast scalability

Size	IB(1 rail)	IB(2 rails)	scaling(1r:2r)	IB(3 rails)	scaling(1r:3r)
262144	18.38	10.5	1.75	7.91	2.32
524288	39.64	21.81	1.82	15.91	2.49
1048576	35.08	19.16	1.83	13.94	2.52
2097152	58.25	30.11	1.93	21.19	2.75
4194304	111.91	57.17	1.96	38.87	2.88
8388608	221.27	112.13	1.97	75.32	2.94
16777216	445.84	221.65	2.01	148.46	3.00
33554432	968.51	441.59	2.19	302.62	3.20
67108864	2087.72	981.76	2.13	628.92	3.32
134217728	4466.42	2098.79	2.13	1347.08	3.32
268435456	9710.48	4487.54	2.16	2875.29	3.38

Command line
osu_scatter
user@IB-1:~$ ~/install/openmpi-4.1.1/release/bin/mpirun -np 4 -host IB-1:1,IB-2:1,IB-3:1,IB-4:1 --allow-run-as-root --mca pml_ucx_tls any -mca pml_ucx_devices any --mca btl ^openib -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_MAX_EAGER_RAILS=3 -x UCX_MAX_RNDV_RAILS=3 ~/install/osu/release/libexec/osu-micro-benchmarks/mpi/collective/osu_scatter -m $((1024*1024)):$((256*1024*1024)) --mem-limit $((4*256*1024*1024))
osu_bcast
user@IB-1:~$ ~/install/openmpi-4.1.1/release/bin/mpirun -np 4 -host IB-1:1,IB-2:1,IB-3:1,IB-4:1 --allow-run-as-root --mca pml_ucx_tls any -mca pml_ucx_devices any --mca btl ^openib -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_MAX_EAGER_RAILS=3 -x UCX_MAX_RNDV_RAILS=3 ~/install/osu/release/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m $((256*1024)):$((256*1024*1024)) --mem-limit $((4*256*1024*1024))

SW versions

UCX 1.12.1
OpenMPI 4.1.1
OSU Micro-Benchmarks 7.0.1

System Configuration

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

CPU

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.027
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

UCX info
It contains a lot of data, I attached it at file IB_ucx_info.txt.

IB_ucx_info.txt

The text was updated successfully, but these errors were encountered:

bosilca · 2023-09-25T21:19:03Z

There is a logical difference between these two collective, and this difference translate in a different amount of data being put on the wire. In the bcast case, the root is sending more data than what the message size is, once to every child in the bcast topology. While this topology will depend on the algorithm used by the collective, it is in general more than 1.

Thus, you can imagine as having multiple, very large, messages to send in same time, allowing the lower level communication library to send one on each link, staying close to perfect scalability.
In the scatter case, things are slightly different, you are sending less to each peer, following a specific pattern.

Assuming you are running one process per node, you have 4 processes in your benchmark. For this number of processes and the size you are looking at, the bcast is using a binary topology, while the scatter is using a blocking ring. You could play with the different algorithms we have available, you should be able to quickly improve your scatter performance by using the non-blocking linear algorithm (#4). Add --mca coll_tuned_scatter_algorithm 3 to your mpirun command line.

Check the output of ompi_info --param coll tuned --level 9 for more info on the possible choice.

afanasyev-ilya · 2023-10-05T04:58:20Z

Hi, @bosilca

thank you for you help.

I wonder why reduce and all_reduce (which is effectively reduce + bcast) also do not scale, similar as scatter, then. I think we should deeper investigate their algorithms....

mdvizov · 2023-10-06T10:53:54Z

Hi, @bosilca

We collected scaling for all 7 algorithms of reduce. Probably, we dont have significant difference between those algorithms in results. Reduce is close to bcast in topology and volume of send data. But we don`t have scaling like a bcast scaling. Could you share your opinion about this problem? Thank you!

bosilca · 2023-10-06T21:56:33Z

How many processes do you have ? Enough to start seeing the benefit of the log in binary/binomial topologies ? At the opposite of the bcast the reduction has little opportunity for overlap between communications (because it would require doubling the temporary buffers).

Also the reduction has the overhead of the MPI_Op between each operation. This is something you can benchmark using the tests/datatype/reduce_local.c test in OMPI.

mdvizov · 2023-10-16T14:55:13Z

How many processes do you have ? Enough to start seeing the benefit of the log in binary/binomial topologies ? At the opposite of the bcast the reduction has little opportunity for overlap between communications (because it would require doubling the temporary buffers).

Also the reduction has the overhead of the MPI_Op between each operation. This is something you can benchmark using the tests/datatype/reduce_local.c test in OMPI.

Thank you for your answer. There are only 4 nodes (one process per node). Possible, it`s not really enough...

bosilca · 2023-10-16T14:59:19Z

Certainly not enough to highlight the differences between different topologies. Indeed, look at the 3 main topologies, binary vs. binomial vs. linear, the differences between depth of the topology is minimal at 4 processes. There are many scientific papers (including some of mine) that model the collective algorithms and could give you an idea of their performance at any message size and/or number of participants.

janjust added the question label Sep 22, 2023

vasslavich closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939

Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939

vasslavich commented Sep 21, 2023 •

edited

Loading

bosilca commented Sep 25, 2023

Uh oh!

afanasyev-ilya commented Oct 5, 2023

Uh oh!

mdvizov commented Oct 6, 2023 •

edited

Loading

Uh oh!

bosilca commented Oct 6, 2023

Uh oh!

mdvizov commented Oct 16, 2023

Uh oh!

bosilca commented Oct 16, 2023

Uh oh!

Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939

Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939

Comments

vasslavich commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bosilca commented Sep 25, 2023

Uh oh!

afanasyev-ilya commented Oct 5, 2023

Uh oh!

mdvizov commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosilca commented Oct 6, 2023

Uh oh!

mdvizov commented Oct 16, 2023

Uh oh!

bosilca commented Oct 16, 2023

Uh oh!

vasslavich commented Sep 21, 2023 •

edited

Loading

mdvizov commented Oct 6, 2023 •

edited

Loading