Skip to content

Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vasslavich opened this issue Sep 21, 2023 · 6 comments
Closed
Labels

Comments

@vasslavich
Copy link

vasslavich commented Sep 21, 2023

Hello, dear colleagues!

The problem of a scalability

I'm benchmarking collective OSU operations on IB. There are available 4 nodes with 3 rails.
For now I have the scalability ~3x for a case osu_bcast when I moved from 1 rail to 3 ones.
But, the scalability of osu_scatter is only ~1.8-2x.
I can't understand why the latency's scalability is so different? In general the both MPI operations (MPI_Bcast and MPI_Scatter) only send their source buffers to the receivers.
Thank you for any ideas!

osu_scatter scalability

Size IB(1 rail) IB(2 rails) scaling(1r:2r) IB(3 rails) scaling(1r:3r)
1048576 95.9 63.81 1.50 52.63 1.82
2097152 207.51 139.68 1.49 102.24 2.03
4194304 454.62 278.5 1.63 218.01 2.09
8388608 898.91 530.83 1.69 426.01 2.11
16777216 1921.04 1131.58 1.70 899.27 2.14
33554432 4531.54 2970.59 1.53 2435 1.86
67108864 10203.65 6745.33 1.51 5595.11 1.82
134217728 21348.49 14114.07 1.51 11707.44 1.82
268435456 44219.3 28810.01 1.53 23595.44 1.87
           

bcast scalability

Size IB(1 rail) IB(2 rails) scaling(1r:2r) IB(3 rails) scaling(1r:3r)
262144 18.38 10.5 1.75 7.91 2.32
524288 39.64 21.81 1.82 15.91 2.49
1048576 35.08 19.16 1.83 13.94 2.52
2097152 58.25 30.11 1.93 21.19 2.75
4194304 111.91 57.17 1.96 38.87 2.88
8388608 221.27 112.13 1.97 75.32 2.94
16777216 445.84 221.65 2.01 148.46 3.00
33554432 968.51 441.59 2.19 302.62 3.20
67108864 2087.72 981.76 2.13 628.92 3.32
134217728 4466.42 2098.79 2.13 1347.08 3.32
268435456 9710.48 4487.54 2.16 2875.29 3.38

Command line
osu_scatter
user@IB-1:~$ ~/install/openmpi-4.1.1/release/bin/mpirun -np 4 -host IB-1:1,IB-2:1,IB-3:1,IB-4:1 --allow-run-as-root --mca pml_ucx_tls any -mca pml_ucx_devices any --mca btl ^openib -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_MAX_EAGER_RAILS=3 -x UCX_MAX_RNDV_RAILS=3 ~/install/osu/release/libexec/osu-micro-benchmarks/mpi/collective/osu_scatter -m $((1024*1024)):$((256*1024*1024)) --mem-limit $((4*256*1024*1024))
osu_bcast
user@IB-1:~$ ~/install/openmpi-4.1.1/release/bin/mpirun -np 4 -host IB-1:1,IB-2:1,IB-3:1,IB-4:1 --allow-run-as-root --mca pml_ucx_tls any -mca pml_ucx_devices any --mca btl ^openib -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_MAX_EAGER_RAILS=3 -x UCX_MAX_RNDV_RAILS=3 ~/install/osu/release/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m $((256*1024)):$((256*1024*1024)) --mem-limit $((4*256*1024*1024))

SW versions

  • UCX 1.12.1
  • OpenMPI 4.1.1
  • OSU Micro-Benchmarks 7.0.1

System Configuration

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

CPU

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.027
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

UCX info
It contains a lot of data, I attached it at file IB_ucx_info.txt.

IB_ucx_info.txt

@bosilca
Copy link
Member

bosilca commented Sep 25, 2023

There is a logical difference between these two collective, and this difference translate in a different amount of data being put on the wire. In the bcast case, the root is sending more data than what the message size is, once to every child in the bcast topology. While this topology will depend on the algorithm used by the collective, it is in general more than 1.

  1. Thus, you can imagine as having multiple, very large, messages to send in same time, allowing the lower level communication library to send one on each link, staying close to perfect scalability.
  2. In the scatter case, things are slightly different, you are sending less to each peer, following a specific pattern.

Assuming you are running one process per node, you have 4 processes in your benchmark. For this number of processes and the size you are looking at, the bcast is using a binary topology, while the scatter is using a blocking ring. You could play with the different algorithms we have available, you should be able to quickly improve your scatter performance by using the non-blocking linear algorithm (#4). Add --mca coll_tuned_scatter_algorithm 3 to your mpirun command line.

Check the output of ompi_info --param coll tuned --level 9 for more info on the possible choice.

@afanasyev-ilya
Copy link

Hi, @bosilca

thank you for you help.

I wonder why reduce and all_reduce (which is effectively reduce + bcast) also do not scale, similar as scatter, then. I think we should deeper investigate their algorithms....

@mdvizov
Copy link

mdvizov commented Oct 6, 2023

Hi, @bosilca

We collected scaling for all 7 algorithms of reduce. Probably, we dont have significant difference between those algorithms in results. Reduce is close to bcast in topology and volume of send data. But we don`t have scaling like a bcast scaling. Could you share your opinion about this problem? Thank you!

Chart

@bosilca
Copy link
Member

bosilca commented Oct 6, 2023

How many processes do you have ? Enough to start seeing the benefit of the log in binary/binomial topologies ? At the opposite of the bcast the reduction has little opportunity for overlap between communications (because it would require doubling the temporary buffers).

Also the reduction has the overhead of the MPI_Op between each operation. This is something you can benchmark using the tests/datatype/reduce_local.c test in OMPI.

@mdvizov
Copy link

mdvizov commented Oct 16, 2023

How many processes do you have ? Enough to start seeing the benefit of the log in binary/binomial topologies ? At the opposite of the bcast the reduction has little opportunity for overlap between communications (because it would require doubling the temporary buffers).

Also the reduction has the overhead of the MPI_Op between each operation. This is something you can benchmark using the tests/datatype/reduce_local.c test in OMPI.

Thank you for your answer. There are only 4 nodes (one process per node). Possible, it`s not really enough...

@bosilca
Copy link
Member

bosilca commented Oct 16, 2023

Certainly not enough to highlight the differences between different topologies. Indeed, look at the 3 main topologies, binary vs. binomial vs. linear, the differences between depth of the topology is minimal at 4 processes. There are many scientific papers (including some of mine) that model the collective algorithms and could give you an idea of their performance at any message size and/or number of participants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants