Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coll/HAN and Coll/Adapt not default on 5.0.x #10347

Closed
bwbarrett opened this issue May 3, 2022 · 25 comments
Closed

Coll/HAN and Coll/Adapt not default on 5.0.x #10347

bwbarrett opened this issue May 3, 2022 · 25 comments

Comments

@bwbarrett
Copy link
Member

We never bumped the priority of the HAN and ADAPT collective components on the 5.0.x branch.

I'm not submitting a PR right now (bumping the priority should be easy) because, at least on EFA, Allgather and Allreduce got considerably slower when using the HAN components. Might be user error, but need to dig more.

@awlauria
Copy link
Contributor

I don't recall if there was any discussion of what priority they should be. @bosilca @janjust @gpaulsen @hppritcha @jsquyres

@jsquyres
Copy link
Member

That was definitely the plan: to "preview" han/adapt in the 4.x series and then make it the default to replace "tuned" in 5.x.

@hppritcha
Copy link
Member

That was my understanding as well.

@awlauria
Copy link
Contributor

awlauria commented May 10, 2022

So the priority for both should be higher than tuned.

Did one of han/adapt need to be a higher priority than the other? Or should they be the same priority?

@wckzhang
Copy link
Contributor

IIRC, HAN has more collectives implemented. We are going to do some performance testing on HAN/Adapt/Tuned.

@bwbarrett
Copy link
Member Author

Before we change the priorities, someone with a device other than EFA really needs to run and see if there is benefit closed to that promised in George's paper.

@awlauria
Copy link
Contributor

@bosilca can you post the performance numbers or provide a link for reference?

@bosilca
Copy link
Member

bosilca commented May 12, 2022

Not sure what I'm expected to provide here ?

@awlauria
Copy link
Contributor

  1. Are there any known performance data for han/adapt compared to the current ompi defaults?
  2. Is there a link to the paper we view?
  3. Did you want to open the PR to raise the priorities? You probably have a better understanding of where these priorities should lie.

@bwbarrett
Copy link
Member Author

On EFA, we see essentially no performance difference between today's v5.0.x branch and running with --mca coll_han_priority 100 --mca coll_adapt_priority 100 on the OSU collective benchmarks. In some cases (allreduce in particular), it hurt performance. There are some oddities in EFA's performance on v5.0.x right now, so this may be because of EFA. My ask was that someone with a network that isn't EFA or TCP try the change and verify that the collectives actually do something good. Otherwise, we're advertising improvements that aren't going to be there.

@devreal
Copy link
Contributor

devreal commented May 13, 2022

I'm planning to investigate #9062 soon and will also look at the general performance of coll/adapt and coll/han on an IB system soon. Will report back once I have the numbers.

@BrendanCunningham
Copy link
Member

@bwbarrett is this with the OSU microbenchmarks collectives? IMB collectives? Other? Which versions.

@jsquyres
Copy link
Member

jsquyres commented May 24, 2022

We talked about this on the call today. @bwbarrett will be sending out some information to the devel list (and/or here) about what he ran for AWS.

A bunch of people on the call today agreed to run collective tests and see how HAN/ADAPT compared to tuned on their networks / environments. Bottom line: we need more data than just this single EFA datapoint:

  • NVIDIA
  • ORNL
  • Cornelius
  • IBM
  • UTK

@BrendanCunningham
Copy link
Member

BrendanCunningham commented Jun 10, 2022

Cornelis OmniPath results, 2 nodes, 22 ranks per node. 1 run of each benchmark in each configuration so I haven't measured variance.

(ompi/v4.1.x, no coll/han coll/adapt MCA arguments):

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                       9.54
2                       9.73
4                       9.95
8                      11.39
16                     11.91
32                     12.65
64                     12.58
128                    15.75
256                    22.78
512                    37.90
1024                   94.52
2048                  176.84
4096                  322.08
8192                  642.53
16384                1251.69
32768                2312.39
65536                5103.28
131072              11283.70
262144              22618.81
524288              45343.99
1048576             90598.81

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                       7.21
8                       7.17
16                      7.24
32                      9.37
64                      9.17
128                    17.09
256                    18.03
512                    18.06
1024                   19.10
2048                   22.77
4096                   29.90
8192                   42.97
16384                  80.84
32768                 138.44
65536                 242.87
131072                504.71
262144                974.78
524288               1922.23
1048576              3875.57

(ompi/v4.1.x, with coll/han coll/adapt MCA arguments):

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                      10.67
2                      10.34
4                      11.25
8                      11.57
16                     10.35
32                     11.39
64                     12.60
128                    15.61
256                    21.30
512                    40.22
1024                   68.57
2048                  128.42
4096                  244.77
8192                  463.33
16384                 825.39
32768                1563.50
65536                3060.89
131072               7140.29
262144              16518.99
524288              32637.53
1048576             63536.21

+ /home/bcunningham/projects/STL-63691/ompi-v4.1.x/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                       8.44
8                       8.29
16                      9.22
32                     11.92
64                     11.42
128                    11.76
256                    12.04
512                    12.64
1024                   14.87
2048                   18.22
4096                   26.91
8192                   43.80
16384                  62.63
32768                 115.12
65536                 223.10
131072                432.17
262144                735.30
524288               1545.48
1048576              3173.03

(ompi/main, no coll/han MCA coll/adapt arguments):

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                       7.70
2                       7.80
4                       8.08
8                       9.19
16                     10.07
32                     11.41
64                     12.90
128                    15.88
256                    23.07
512                    38.38
1024                   95.73
2048                  166.95
4096                  327.29
8192                  652.92
16384                1263.14
32768                2332.43
65536                5140.10
131072              11231.33
262144              22839.35
524288              45512.69
1048576             90426.54

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                       7.76
8                       7.64
16                      7.53
32                      9.33
64                      9.24
128                    17.82
256                    17.74
512                    18.11
1024                   18.94
2048                   23.07
4096                   29.85
8192                   42.78
16384                  81.19
32768                 133.46
65536                 248.76
131072                510.12
262144                981.97
524288               1989.39
1048576              3901.91

(ompi/main, with coll/han coll/adapt MCA arguments):

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allgather

# OSU MPI Allgather Latency Test v5.9
# Size       Avg Latency(us)
1                       9.43
2                       8.97
4                       9.89
8                      10.20
16                     10.74
32                     12.23
64                     13.29
128                    16.28
256                    22.00
512                    40.14
1024                   67.05
2048                  126.75
4096                  259.95
8192                  448.19
16384                 823.10
32768                1589.41
65536                3067.19
131072               7037.30
262144              16388.66
524288              32337.04
1048576             63411.71

+ /home/bcunningham/projects/STL-63691/ompi-main/bin/mpirun -np 44 --map-by ppr:22:node \
      -host cn-priv-03:22,cn-priv-04:22 --mca mtl ofi --mca btl ofi -x FI_PROVIDER=psm2 \
      --mca coll_han_priority 100 --mca coll_adapt_priority 100 ./mpi/collective/osu_allreduce

# OSU MPI Allreduce Latency Test v5.9
# Size       Avg Latency(us)
4                      12.36
8                      11.79
16                     13.08
32                     14.66
64                     14.34
128                    14.52
256                    14.82
512                    15.36
1024                   17.52
2048                   20.18
4096                   25.30
8192                   36.60
16384                  67.08
32768                 114.37
65536                 236.39
131072                407.80
262144                668.37
524288               1473.07
1048576              2795.97

@janjust
Copy link
Contributor

janjust commented Jun 13, 2022

X86 ConnectX-6 cluster, 32 nodes, 40 PPN.
OMPI V5.0.X

Any value above 0% is Adapt/HAN outperforming Tuned.

mpirun -np 1280 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --mca pml ucx --mca coll tuned,libnbc,basic -x UCX_WARN_UNUSED_ENV_VARS=n -x LD_LIBRARY_PATH ${exe}

mpirun -np 1280 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --mca pml ucx --mca coll   adapt,han,libnbc,basic --mca coll_adapt_priority 100 --mca coll_han_priority   100 -x UCX_WARN_UNUSED_ENV_VARS=n -x LD_LIBRARY_PATH ${exe}

image

image

image

@awlauria
Copy link
Contributor

awlauria commented Jun 14, 2022

Here's some runs with han + adapt compared to current collective defaults using ob1 on power9 using the v5.0.x branch.

6 nodes at 16 ppn. Testing at 40 ppn seems to have similar (or maybe better) results for han..But I did not aggregate them. A negative % indicates that han/adapt did better, higher percentage it did worse.

Running on these same machines with mofed 4.9 + ucx 1.10.1 showed little to no difference when comparing the defaults v. han/adapt, so I didn't bother posting the graphs.

image

image

image

@awlauria
Copy link
Contributor

awlauria commented Jun 15, 2022

I seem to be running into issues running with han/adapt with the imb benchmarks, for example (with --map-by node)

# Iscatter

#------------------------------------------------------------------------------------------
# Benchmarking Iscatter 
# #processes = 640 
#------------------------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]      defects
            0           11         0.00         0.00         0.00         0.00         0.00
            1           11         0.00         0.00         0.00         0.00         0.00
            2           11         0.00         0.00         0.00         0.00         0.00
            4           11         0.00         0.00         0.00         0.00         0.00
            8           11         0.00         0.00         0.00         0.00         0.00
           16           11         0.00         0.00         0.00         0.00         0.00
           32           11         0.00         0.00         0.00         0.00         0.00
           64           11         0.00         0.00         0.00         0.00         0.00
          128           11         0.00         0.00         0.00         0.00         0.00
          256           11         0.00         0.00         0.00         0.00         0.00
          512           11         0.00         0.00         0.00         0.00         0.00
         1024           11         0.00         0.00         0.00         0.00         0.00

which is slightly worrisome. Has anyone else run imb with han/adapt, and gotten actual numbers? I seem to get numbers with --map-by core, --map-by node is what triggers the above.

@wckzhang
Copy link
Contributor

wckzhang commented Jul 6, 2022

@awlauria, your graphs don't have x,y labels. Is your y the same as @janjust 's y scale (ie. anything above 0 is han/adapt performing better, and anything below is worse)?

@awlauria
Copy link
Contributor

awlauria commented Jul 6, 2022

Oh, foo. You are right. that isn't very clear...

It's actually the opposite. The Y axis measures the performance improvement of Han/Adapt v. the defaults, the X axis is message size. So for my graphs anything below 0 means that Han/adapts time for the same test was X% lower than the default ...So below 0 represents an improvement for Han/Adapt.

It's still on my to-do list to re-run these to confirm my findings.

@wckzhang
Copy link
Contributor

wckzhang commented Jul 6, 2022

So it looks like HAN/Adapt outperform Tuned except at the largest message sizes, at least for your 6 node tests.

@awlauria
Copy link
Contributor

awlauria commented Jul 6, 2022

Correct. I would like confirmation of that. I will re-run these numbers, perhaps at a slightly larger scale. Will try to do that by the end of this week.

@gkatev
Copy link
Contributor

gkatev commented Aug 11, 2022

I'm also attaching some benchmarks I performed at some point. I have only experimented with bcast, reduce, allreduce, barrier.

Settings

v5.0.x (rc7, I believe)
map by & bind to core
pml=ob1 btl=sm,uct smsc=xpmem

For the fine-tuned configurations:

tuned fine-tuned

use_dynamic_rules=true

bcast_algorithm=7 (knomial)
bcast_algorithm_segmentsize=128K
bcast_algorithm_knomial_radix=2

reduce_algorithm=6 (in-order binary)
reduce_algorithm_segmentsize=128K

allreduce_algorithm=2 (nonoverlapping (reduce+bcast))

adapt fine-tuned

bcast_algorithm=2 (in_order_binomial)
bcast_segment_size=128K

reduce_algorithm=2 (in_order_binomial)
reduce_segment_size=128K

The <component>+<component> configurations imply HAN (format: <up module>+<low module>)

Full collection: plots.tar.gz

Some of them:

dp-dam-bcast-8x
tie-allreduce-5x
tie-reduce-5x
dp-dam-barrier-10x

@gkatev
Copy link
Contributor

gkatev commented Aug 18, 2022

FYI at the moment, these 3 issues can impact HAN:

#10335
#10456
#10458

So if you attempt to use the MCA parameters and adjust the chosen sub-components, I suggest applying the fixes and/or verifying that the expected components are actually used.

@devreal
Copy link
Contributor

devreal commented Aug 25, 2022

Sorry for the delay but here are some measurements for HAN on Hawk. I measured several different configurations, including 8, 24, 48, and 64 processes per node. I also ran with coll/sm as the backend but the differences seem minor. I also increased the HAN segment size to 256k (and did the same for coll/sm), which seems to have a positive impact on larger data sizes.

Takeaway: there are certain configuration where coll/tuned is faster than coll/han:

  1. For small messages at 1k procs on 16 nodes. I assume there is some algorithm change that plays nicely with the way OSU measures time.
  2. coll/han seems to consistently get slower for larger data sizes. The increased segment size has helped here but further increasing the segment size didn't have an impact for me. That needs some more investigation.
  3. Overall, coll/han shows consistent performance whereas coll/tuned has huge variations between configuration. I assume this is an artifact of the opaque tuning decisions.

Unfortunately, not all runs were successful (all runs in that job aborted).

8 Procs per node:

reduce_8_osu_han_8_8_1752645 hawk-pbs5 pdf-1
reduce_8_osu_han_8_8_1752645 hawk-pbs5 pdf-2
reduce_8_osu_han_8_8_1752645 hawk-pbs5 pdf-3

24 Procs per node:

reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-01
reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-02
reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-03
reduce_24_osu_han_8_24_1752649 hawk-pbs5 pdf-04

48 Procs per node:

reduce_48_osu_han_8_48_1752653 hawk-pbs5 pdf-1
reduce_48_osu_han_8_48_1752653 hawk-pbs5 pdf-2
reduce_48_osu_han_8_48_1752653 hawk-pbs5 pdf-3

64 Procs per node:

reduce_64_osu_han_8_64_1752657 hawk-pbs5 pdf-1
reduce_64_osu_han_8_64_1752657 hawk-pbs5 pdf-2
reduce_64_osu_han_8_64_1752657 hawk-pbs5 pdf-3
reduce_64_osu_han_8_64_1752774 hawk-pbs5 pdf-04

Here is how I configured my runs:

  • coll/tuned: mpirun --mca coll ^hcoll --rank-by ${rankby} $mapby -N $npn -n $nprocs --bind-to core --mca coll_han_priority 0 --mca coll_hcoll_enable 0
  • coll/han: mpirun --mca coll ^hcoll --rank-by ${rankby} $mapby -N $npn -n $nprocs --bind-to core --mca coll_han_priority 100 --mca coll_han_reduce_segsize $((256*1024)) --mca coll_hcoll_enable 0
  • coll/han with coll/sm: mpirun --rank-by ${rankby} $mapby -N $npn -n $((nprocs)) --bind-to core --mca coll_han_priority 100 --mca coll_hcoll_enable 0 --mca coll_han_reduce_segsize $((256*1024)) --mca coll_sm_priority 80 --mca coll_sm_fragment_size $((260*1024))

@gpaulsen
Copy link
Member

Closed by #11362 and #11389

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests