You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of working on #10347 I found that the performance of coll/tuned varies significantly for small reductions, depending on the number of nodes I run on. All experiments are done on Hawk, the dual-socket AMD EPYC Rome system using ConnectX-6 installed at HLRS. I'm using the main branch of Open MPI.
coll/han does seem to provide better performance if the ranking is done by-node (no surprise since it reduces cross-node traffic) and provides fairly consistent performance with by-core ranking. coll/tuned varies between higher and lower latency than coll/han.
I have collected data for 4B reductions using the OSU benchmarks, with the -f flag to gather average, min, and max latency observed on any process. It is interesting to note that it is not just the average that varies but (to a lesser degree) also the maximum latency. That suggests that it's not an artifact of the benchmark's timing.
In order to not waste too many node hours, I allocated nodes in multiples of 8 and ran on N-7..N nodes on N nodes. I tested partitions 1-8, 9-16, ... and 5-12, 13-20, 21-28, ... to make sure that the effects I'm seeing aren't an artifact of this partitioning. The results are the same. In all cases, I ran with 64 processes per node. I have data with 48 processes per node where the effects are the same, albeit with slightly shifted switching points.
I dumped the binomial trees used in coll/tuned for both the by-core and by-node ranking and it looks just like what you would expect: high number of intra-node communication with by-core ranking and high inter-node communication with by-node ranking. This is reflected in the performance above. I didn't find any change in the tree between slow and fast runs of coll/tuned.
In the graphs below, same node colors represent the compute node (same color, same compute node). Dashed lines represent shared memory communication and solid lines represent inter-node communication.
By-core ranking:
By-node ranking:
No surprise here, but it confirms that we're doing the right thing in coll/tuned with a linear distribution across nodes (at least with binomial trees).
@janjust could you please run a similar set of benchmarks on your machines to make sure I'm not chasing a machine artifact?
The text was updated successfully, but these errors were encountered:
As part of working on #10347 I found that the performance of coll/tuned varies significantly for small reductions, depending on the number of nodes I run on. All experiments are done on Hawk, the dual-socket AMD EPYC Rome system using ConnectX-6 installed at HLRS. I'm using the
main
branch of Open MPI.coll/han does seem to provide better performance if the ranking is done by-node (no surprise since it reduces cross-node traffic) and provides fairly consistent performance with by-core ranking. coll/tuned varies between higher and lower latency than coll/han.
I have collected data for 4B reductions using the OSU benchmarks, with the
-f
flag to gather average, min, and max latency observed on any process. It is interesting to note that it is not just the average that varies but (to a lesser degree) also the maximum latency. That suggests that it's not an artifact of the benchmark's timing.In order to not waste too many node hours, I allocated nodes in multiples of 8 and ran on N-7..N nodes on N nodes. I tested partitions 1-8, 9-16, ... and 5-12, 13-20, 21-28, ... to make sure that the effects I'm seeing aren't an artifact of this partitioning. The results are the same. In all cases, I ran with 64 processes per node. I have data with 48 processes per node where the effects are the same, albeit with slightly shifted switching points.
Average latency:

Maximum latency:

Minimum latency:

Here is how I run the benchmarks:
rankby
is eithernode
orcore
.npn
is 64 in the data above. I had to disable uct because I get segfaults otherwise (different story, not sure why).As far as I can tell from the decision function, all runs use the binomial reduction tree (https://github.com/open-mpi/ompi/blob/main/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L662).
I dumped the binomial trees used in coll/tuned for both the by-core and by-node ranking and it looks just like what you would expect: high number of intra-node communication with by-core ranking and high inter-node communication with by-node ranking. This is reflected in the performance above. I didn't find any change in the tree between slow and fast runs of coll/tuned.
In the graphs below, same node colors represent the compute node (same color, same compute node). Dashed lines represent shared memory communication and solid lines represent inter-node communication.
By-core ranking:

By-node ranking:

No surprise here, but it confirms that we're doing the right thing in coll/tuned with a linear distribution across nodes (at least with binomial trees).
@janjust could you please run a similar set of benchmarks on your machines to make sure I'm not chasing a machine artifact?
The text was updated successfully, but these errors were encountered: