-
Notifications
You must be signed in to change notification settings - Fork 901
Scalability difference osu_bcast vs osu_scatter for multi-rails on InfiniBand #11939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is a logical difference between these two collective, and this difference translate in a different amount of data being put on the wire. In the bcast case, the root is sending more data than what the message size is, once to every child in the bcast topology. While this topology will depend on the algorithm used by the collective, it is in general more than 1.
Assuming you are running one process per node, you have 4 processes in your benchmark. For this number of processes and the size you are looking at, the bcast is using a binary topology, while the scatter is using a blocking ring. You could play with the different algorithms we have available, you should be able to quickly improve your scatter performance by using the non-blocking linear algorithm (#4). Add Check the output of |
Hi, @bosilca thank you for you help. I wonder why reduce and all_reduce (which is effectively reduce + bcast) also do not scale, similar as scatter, then. I think we should deeper investigate their algorithms.... |
Hi, @bosilca We collected scaling for all 7 algorithms of reduce. Probably, we dont have significant difference between those algorithms in results. Reduce is close to bcast in topology and volume of send data. But we don`t have scaling like a bcast scaling. Could you share your opinion about this problem? Thank you! |
How many processes do you have ? Enough to start seeing the benefit of the log in binary/binomial topologies ? At the opposite of the bcast the reduction has little opportunity for overlap between communications (because it would require doubling the temporary buffers). Also the reduction has the overhead of the MPI_Op between each operation. This is something you can benchmark using the |
Thank you for your answer. There are only 4 nodes (one process per node). Possible, it`s not really enough... |
Certainly not enough to highlight the differences between different topologies. Indeed, look at the 3 main topologies, binary vs. binomial vs. linear, the differences between depth of the topology is minimal at 4 processes. There are many scientific papers (including some of mine) that model the collective algorithms and could give you an idea of their performance at any message size and/or number of participants. |
Uh oh!
There was an error while loading. Please reload this page.
Hello, dear colleagues!
The problem of a scalability
I'm benchmarking collective OSU operations on IB. There are available 4 nodes with 3 rails.
For now I have the scalability ~3x for a case osu_bcast when I moved from 1 rail to 3 ones.
But, the scalability of osu_scatter is only ~1.8-2x.
I can't understand why the latency's scalability is so different? In general the both MPI operations (MPI_Bcast and MPI_Scatter) only send their source buffers to the receivers.
Thank you for any ideas!
osu_scatter scalability
bcast scalability
Command line
osu_scatter
user@IB-1:~$ ~/install/openmpi-4.1.1/release/bin/mpirun -np 4 -host IB-1:1,IB-2:1,IB-3:1,IB-4:1 --allow-run-as-root --mca pml_ucx_tls any -mca pml_ucx_devices any --mca btl ^openib -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_MAX_EAGER_RAILS=3 -x UCX_MAX_RNDV_RAILS=3 ~/install/osu/release/libexec/osu-micro-benchmarks/mpi/collective/osu_scatter -m $((1024*1024)):$((256*1024*1024)) --mem-limit $((4*256*1024*1024))
osu_bcast
user@IB-1:~$ ~/install/openmpi-4.1.1/release/bin/mpirun -np 4 -host IB-1:1,IB-2:1,IB-3:1,IB-4:1 --allow-run-as-root --mca pml_ucx_tls any -mca pml_ucx_devices any --mca btl ^openib -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1 -x UCX_MAX_EAGER_RAILS=3 -x UCX_MAX_RNDV_RAILS=3 ~/install/osu/release/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m $((256*1024)):$((256*1024*1024)) --mem-limit $((4*256*1024*1024))
SW versions
System Configuration
CPU
UCX info
It contains a lot of data, I attached it at file IB_ucx_info.txt.
IB_ucx_info.txt
The text was updated successfully, but these errors were encountered: