Reduction latency variation with coll/tuned #10947

devreal · 2022-10-18T16:29:10Z

As part of working on #10347 I found that the performance of coll/tuned varies significantly for small reductions, depending on the number of nodes I run on. All experiments are done on Hawk, the dual-socket AMD EPYC Rome system using ConnectX-6 installed at HLRS. I'm using the main branch of Open MPI.

coll/han does seem to provide better performance if the ranking is done by-node (no surprise since it reduces cross-node traffic) and provides fairly consistent performance with by-core ranking. coll/tuned varies between higher and lower latency than coll/han.

I have collected data for 4B reductions using the OSU benchmarks, with the -f flag to gather average, min, and max latency observed on any process. It is interesting to note that it is not just the average that varies but (to a lesser degree) also the maximum latency. That suggests that it's not an artifact of the benchmark's timing.

In order to not waste too many node hours, I allocated nodes in multiples of 8 and ran on N-7..N nodes on N nodes. I tested partitions 1-8, 9-16, ... and 5-12, 13-20, 21-28, ... to make sure that the effects I'm seeing aren't an artifact of this partitioning. The results are the same. In all cases, I ran with 64 processes per node. I have data with 48 processes per node where the effects are the same, albeit with slightly shifted switching points.

Average latency:

Maximum latency:

Minimum latency:

Here is how I run the benchmarks:

mpirun --mca coll ^hcoll --rank-by ${rankby} -N $npn -n $((npn*nodes)) --bind-to core --mca coll_tuned_priority 100 --mca btl ^uct mpi/collective/osu_reduce -f -m 4:4

rankby is either node or core. npn is 64 in the data above. I had to disable uct because I get segfaults otherwise (different story, not sure why).

As far as I can tell from the decision function, all runs use the binomial reduction tree (https://github.com/open-mpi/ompi/blob/main/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L662).

I dumped the binomial trees used in coll/tuned for both the by-core and by-node ranking and it looks just like what you would expect: high number of intra-node communication with by-core ranking and high inter-node communication with by-node ranking. This is reflected in the performance above. I didn't find any change in the tree between slow and fast runs of coll/tuned.

In the graphs below, same node colors represent the compute node (same color, same compute node). Dashed lines represent shared memory communication and solid lines represent inter-node communication.

By-core ranking:

By-node ranking:

No surprise here, but it confirms that we're doing the right thing in coll/tuned with a linear distribution across nodes (at least with binomial trees).

@janjust could you please run a similar set of benchmarks on your machines to make sure I'm not chasing a machine artifact?

The text was updated successfully, but these errors were encountered:

devreal added the Target: main label Oct 18, 2022

devreal assigned janjust and devreal Oct 18, 2022

This was referenced Nov 2, 2022

How to define and measure the performance of MPI_Bcast exactly? #11022

Open

tuned: use tree instead of bruck at scale #11023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduction latency variation with coll/tuned #10947

Reduction latency variation with coll/tuned #10947

devreal commented Oct 18, 2022

Reduction latency variation with coll/tuned #10947

Reduction latency variation with coll/tuned #10947

Comments

devreal commented Oct 18, 2022