test/bench: add bcast/get/put benchmark #7157

hzhou · 2024-10-02T23:11:17Z

Pull Request Description

Following #6907, this PR adds collective benchmarks, starting with barrier and bcast.

Also add get_bw and put_bw. They work almost exactly as p2p_bw.

On my desktop:

$ export MPITEST_VERBOSE=1
$ mpirun -l -n 8 ./bcast -memtype=device
[7] Allocating buffer: memtype=device, device=7, size=5000000
[0] Allocating buffer: memtype=device, device=0, size=5000000
[0] TEST bcast:
[6] Allocating buffer: memtype=device, device=6, size=5000000
[2] Allocating buffer: memtype=device, device=2, size=5000000
[3] Allocating buffer: memtype=device, device=3, size=5000000
[5] Allocating buffer: memtype=device, device=5, size=5000000
[1] Allocating buffer: memtype=device, device=1, size=5000000
[4] Allocating buffer: memtype=device, device=4, size=5000000
[0] Barrier latency 33.098 +/- 0.743 us
[0]            1    283.896      4.881            0.004
[0]            2    282.667      0.864            0.007
[0]            4    282.172      0.955            0.014
[0]            8    282.547      0.987            0.028
[0]           16    281.973      0.960            0.057
[0]           32    258.795      0.529            0.124
[0]           64    258.640      1.185            0.247
[0]          128    258.742      0.917            0.495
[0]          256    288.946      3.025            0.886
[0]          512    275.079     10.361            1.861
[0]         1024    278.241      3.867            3.680
[0]         2048    276.950      3.892            7.395
[0]         4096    279.772      4.040           14.641
[0]         8192    284.587      3.501           28.786
[0]        16384    293.846      5.354           55.757
[0]        32768    112.957      2.848          290.093
[0]        65536    176.461     26.542          371.392
[0]       131072    279.492     12.952          468.965
[0]       262144    471.296     17.447          556.219
[0]       524288   1091.045     17.324          480.537
[0]      1048576   2338.931     39.098          448.314
[0]      2097152   3355.044     36.571          625.074
[0]      4194304   6950.164     71.623          603.483
[0]
[0]  No Errors[0]

mpirun -n 2 ./get_bw -sendmem=host -recvmem=host
Allocating buffer: memtype=host, device=0, size=5000000
Allocating buffer: memtype=host, device=1, size=5000000
TEST get_bw:
     msgsize    latency(us)  sigma(us)    bandwidth(MB/s)
           1      1.797      0.153            0.556
           2      1.745      0.007            1.146
           4      1.748      0.005            2.289
           8      1.747      0.005            4.579
          16      1.748      0.005            9.153
          32      1.747      0.005           18.319
          64      1.741      0.005           36.754
         128      1.746      0.006           73.321
         256      1.747      0.003          146.510
         512      1.756      0.005          291.540
        1024      1.850      0.005          553.436
        2048      1.922      0.005         1065.349
        4096      2.054      0.006         1993.963
        8192      2.342      0.008         3497.714
       16384      4.022      0.009         4073.417
       32768      9.231      0.023         3549.756
       65536     16.963      0.027         3863.518
      131072     32.997      0.297         3972.189
      262144     64.729      0.043         4049.897
      524288    128.234      0.089         4088.523
     1048576    167.923      0.535         6244.399
     2097152    346.319      0.630         6055.552
     4194304    744.334     97.164         5634.973

 No Errors

$ mpirun -n 4 ./barrier
TEST barrier:
Barrier latency 0.897 +/- 0.003 us
Barrier latency 0.900 +/- 0.003 us
Barrier latency 0.901 +/- 0.003 us
Barrier latency 0.895 +/- 0.003 us
Barrier latency 0.900 +/- 0.003 us
Barrier latency 0.899 +/- 0.002 us
Barrier latency 0.901 +/- 0.003 us
Barrier latency 0.894 +/- 0.003 us
Barrier latency 0.898 +/- 0.009 us
Barrier latency 0.901 +/- 0.003 us

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

* Add or improve some comments * Move MAX_BUFSIZE and NUM_REPEAT to bench_frame.def since it is not specific to p2p tests * Make grank and gsize global so we don't need to pass between functions. * Add MEM_TYPES macro to different ways of setting memtype and memory devices. * use foreach_size macro in benchmarks * macro run_stat optionally accept RUN_STAT_VARIANCE to calculate variance rather than standard deviation. This is useful if we want to collectively reduce variance. * macros warm_up add optional WARM_UP_NUM_BEST to control the quality of warm_up. The higher the longer to warm up. * macro warm_up to allow custom MIN_ITER. * macro warm_up to set a ceiling iter=10000. Some collective will shortcut skip for 0-sized data. This prevent too much unnecessary warm-up iterations.

Add bench_coll.def for common collective benchmark macros. Add Barrier benchmarks. Since this is trivial, we run it 10 times to show some fluctuations. Adding bcast bench test. This version will measure cumulative latency (as p2p_latency) combined with barrier + bcast. The reported data sutracts the barrier latency.

The barrier often does not exit uniformly especially if node-topology is in play. This affects different collectie algorithms differently, thus using the combined latency doesn't hides too much details for algorithm comparisons. The osu microbenchmarks measures collective latency individually then reduce for min, max, and average. Why it is still suceptible to barrier behavior, it does provide more details for some insights comparing different algorithms.

Measure the bandwidth for put and get similar to p2p_bw. Note: use "-sendmem [type] -recvmem [type]" to set memory types.

hzhou force-pushed the 2410_bench branch from a267916 to b4fe32c Compare October 2, 2024 23:15

hzhou force-pushed the 2410_bench branch 2 times, most recently from 8da66ac to da964c2 Compare November 4, 2024 01:00

hzhou force-pushed the 2410_bench branch from da964c2 to 45775b1 Compare January 19, 2025 14:57

hzhou changed the title ~~test/bench: add bcast benchmark~~ test/bench: add bcast/get/put benchmark Jan 19, 2025

hzhou force-pushed the 2410_bench branch from 45775b1 to 442386f Compare January 19, 2025 15:17

hzhou requested a review from raffenet January 19, 2025 15:18

raffenet approved these changes Jan 22, 2025

View reviewed changes

hzhou added 4 commits January 22, 2025 16:16

test/bench: add get_bw and put_bw

d637e57

Measure the bandwidth for put and get similar to p2p_bw. Note: use "-sendmem [type] -recvmem [type]" to set memory types.

hzhou force-pushed the 2410_bench branch from 442386f to d637e57 Compare January 22, 2025 22:16

hzhou merged commit 980cf27 into pmodels:main Jan 22, 2025
4 checks passed

hzhou deleted the 2410_bench branch January 22, 2025 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/bench: add bcast/get/put benchmark #7157

test/bench: add bcast/get/put benchmark #7157

hzhou commented Oct 2, 2024 •

edited

Loading

test/bench: add bcast/get/put benchmark #7157

test/bench: add bcast/get/put benchmark #7157

Conversation

hzhou commented Oct 2, 2024 • edited Loading

Pull Request Description

Author Checklist

hzhou commented Oct 2, 2024 •

edited

Loading