Skip to content

Reduction latency variation with coll/tuned #10947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
devreal opened this issue Oct 18, 2022 · 0 comments
Open

Reduction latency variation with coll/tuned #10947

devreal opened this issue Oct 18, 2022 · 0 comments
Assignees

Comments

@devreal
Copy link
Contributor

devreal commented Oct 18, 2022

As part of working on #10347 I found that the performance of coll/tuned varies significantly for small reductions, depending on the number of nodes I run on. All experiments are done on Hawk, the dual-socket AMD EPYC Rome system using ConnectX-6 installed at HLRS. I'm using the main branch of Open MPI.

coll/han does seem to provide better performance if the ranking is done by-node (no surprise since it reduces cross-node traffic) and provides fairly consistent performance with by-core ranking. coll/tuned varies between higher and lower latency than coll/han.

I have collected data for 4B reductions using the OSU benchmarks, with the -f flag to gather average, min, and max latency observed on any process. It is interesting to note that it is not just the average that varies but (to a lesser degree) also the maximum latency. That suggests that it's not an artifact of the benchmark's timing.

In order to not waste too many node hours, I allocated nodes in multiples of 8 and ran on N-7..N nodes on N nodes. I tested partitions 1-8, 9-16, ... and 5-12, 13-20, 21-28, ... to make sure that the effects I'm seeing aren't an artifact of this partitioning. The results are the same. In all cases, I ran with 64 processes per node. I have data with 48 processes per node where the effects are the same, albeit with slightly shifted switching points.

Average latency:
reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-2

Maximum latency:
reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-6

Minimum latency:
reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-4

Here is how I run the benchmarks:

mpirun --mca coll ^hcoll --rank-by ${rankby} -N $npn -n $((npn*nodes)) --bind-to core --mca coll_tuned_priority 100 --mca btl ^uct mpi/collective/osu_reduce -f -m 4:4

rankby is either node or core. npn is 64 in the data above. I had to disable uct because I get segfaults otherwise (different story, not sure why).

As far as I can tell from the decision function, all runs use the binomial reduction tree (https://github.com/open-mpi/ompi/blob/main/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L662).

I dumped the binomial trees used in coll/tuned for both the by-core and by-node ranking and it looks just like what you would expect: high number of intra-node communication with by-core ranking and high inter-node communication with by-node ranking. This is reflected in the performance above. I didn't find any change in the tree between slow and fast runs of coll/tuned.

In the graphs below, same node colors represent the compute node (same color, same compute node). Dashed lines represent shared memory communication and solid lines represent inter-node communication.

By-core ranking:
bmtree_leafs_4x64_bycore dot

By-node ranking:
bmtree_leafs_4x64_bynode dot

No surprise here, but it confirms that we're doing the right thing in coll/tuned with a linear distribution across nodes (at least with binomial trees).

@janjust could you please run a similar set of benchmarks on your machines to make sure I'm not chasing a machine artifact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants