-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coll/HAN and Coll/Adapt not default on 5.0.x #10347
Comments
I don't recall if there was any discussion of what priority they should be. @bosilca @janjust @gpaulsen @hppritcha @jsquyres |
That was definitely the plan: to "preview" han/adapt in the 4.x series and then make it the default to replace "tuned" in 5.x. |
That was my understanding as well. |
So the priority for both should be higher than tuned. Did one of han/adapt need to be a higher priority than the other? Or should they be the same priority? |
IIRC, HAN has more collectives implemented. We are going to do some performance testing on HAN/Adapt/Tuned. |
Before we change the priorities, someone with a device other than EFA really needs to run and see if there is benefit closed to that promised in George's paper. |
@bosilca can you post the performance numbers or provide a link for reference? |
Not sure what I'm expected to provide here ? |
|
On EFA, we see essentially no performance difference between today's v5.0.x branch and running with |
I'm planning to investigate #9062 soon and will also look at the general performance of coll/adapt and coll/han on an IB system soon. Will report back once I have the numbers. |
@bwbarrett is this with the OSU microbenchmarks collectives? IMB collectives? Other? Which versions. |
We talked about this on the call today. @bwbarrett will be sending out some information to the devel list (and/or here) about what he ran for AWS. A bunch of people on the call today agreed to run collective tests and see how HAN/ADAPT compared to tuned on their networks / environments. Bottom line: we need more data than just this single EFA datapoint:
|
Cornelis OmniPath results, 2 nodes, 22 ranks per node. 1 run of each benchmark in each configuration so I haven't measured variance. (ompi/v4.1.x, no coll/han coll/adapt MCA arguments):
(ompi/v4.1.x, with coll/han coll/adapt MCA arguments):
(ompi/main, no coll/han MCA coll/adapt arguments):
(ompi/main, with coll/han coll/adapt MCA arguments):
|
X86 ConnectX-6 cluster, 32 nodes, 40 PPN. Any value above 0% is Adapt/HAN outperforming Tuned.
|
Here's some runs with han + adapt compared to current collective defaults using ob1 on power9 using the v5.0.x branch. 6 nodes at 16 ppn. Testing at 40 ppn seems to have similar (or maybe better) results for han..But I did not aggregate them. A negative % indicates that han/adapt did better, higher percentage it did worse. Running on these same machines with mofed 4.9 + ucx 1.10.1 showed little to no difference when comparing the defaults v. han/adapt, so I didn't bother posting the graphs. |
I seem to be running into issues running with han/adapt with the imb benchmarks, for example (with --map-by node)
which is slightly worrisome. Has anyone else run imb with han/adapt, and gotten actual numbers? I seem to get numbers with --map-by core, |
Oh, foo. You are right. that isn't very clear... It's actually the opposite. The Y axis measures the performance improvement of Han/Adapt v. the defaults, the X axis is message size. So for my graphs anything below 0 means that Han/adapts time for the same test was X% lower than the default ...So below 0 represents an improvement for Han/Adapt. It's still on my to-do list to re-run these to confirm my findings. |
So it looks like HAN/Adapt outperform Tuned except at the largest message sizes, at least for your 6 node tests. |
Correct. I would like confirmation of that. I will re-run these numbers, perhaps at a slightly larger scale. Will try to do that by the end of this week. |
I'm also attaching some benchmarks I performed at some point. I have only experimented with bcast, reduce, allreduce, barrier. Settings
For the fine-tuned configurations: tuned fine-tuned
adapt fine-tuned
The Full collection: plots.tar.gz Some of them: |
Sorry for the delay but here are some measurements for HAN on Hawk. I measured several different configurations, including 8, 24, 48, and 64 processes per node. I also ran with Takeaway: there are certain configuration where
Unfortunately, not all runs were successful (all runs in that job aborted). 8 Procs per node:24 Procs per node:48 Procs per node:64 Procs per node:Here is how I configured my runs:
|
We never bumped the priority of the HAN and ADAPT collective components on the 5.0.x branch.
I'm not submitting a PR right now (bumping the priority should be easy) because, at least on EFA, Allgather and Allreduce got considerably slower when using the HAN components. Might be user error, but need to dig more.
The text was updated successfully, but these errors were encountered: