Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/bench: add bcast/get/put benchmark #7157

Merged
merged 4 commits into from
Jan 22, 2025
Merged

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Oct 2, 2024

Pull Request Description

Following #6907, this PR adds collective benchmarks, starting with barrier and bcast.

Also add get_bw and put_bw. They work almost exactly as p2p_bw.

On my desktop:

$ export MPITEST_VERBOSE=1
$ mpirun -l -n 8 ./bcast -memtype=device
[7] Allocating buffer: memtype=device, device=7, size=5000000
[0] Allocating buffer: memtype=device, device=0, size=5000000
[0] TEST bcast:
[6] Allocating buffer: memtype=device, device=6, size=5000000
[2] Allocating buffer: memtype=device, device=2, size=5000000
[3] Allocating buffer: memtype=device, device=3, size=5000000
[5] Allocating buffer: memtype=device, device=5, size=5000000
[1] Allocating buffer: memtype=device, device=1, size=5000000
[4] Allocating buffer: memtype=device, device=4, size=5000000
[0] Barrier latency 33.098 +/- 0.743 us
[0]            1    283.896      4.881            0.004
[0]            2    282.667      0.864            0.007
[0]            4    282.172      0.955            0.014
[0]            8    282.547      0.987            0.028
[0]           16    281.973      0.960            0.057
[0]           32    258.795      0.529            0.124
[0]           64    258.640      1.185            0.247
[0]          128    258.742      0.917            0.495
[0]          256    288.946      3.025            0.886
[0]          512    275.079     10.361            1.861
[0]         1024    278.241      3.867            3.680
[0]         2048    276.950      3.892            7.395
[0]         4096    279.772      4.040           14.641
[0]         8192    284.587      3.501           28.786
[0]        16384    293.846      5.354           55.757
[0]        32768    112.957      2.848          290.093
[0]        65536    176.461     26.542          371.392
[0]       131072    279.492     12.952          468.965
[0]       262144    471.296     17.447          556.219
[0]       524288   1091.045     17.324          480.537
[0]      1048576   2338.931     39.098          448.314
[0]      2097152   3355.044     36.571          625.074
[0]      4194304   6950.164     71.623          603.483
[0]
[0]  No Errors[0]

mpirun -n 2 ./get_bw -sendmem=host -recvmem=host
Allocating buffer: memtype=host, device=0, size=5000000
Allocating buffer: memtype=host, device=1, size=5000000
TEST get_bw:
     msgsize    latency(us)  sigma(us)    bandwidth(MB/s)
           1      1.797      0.153            0.556
           2      1.745      0.007            1.146
           4      1.748      0.005            2.289
           8      1.747      0.005            4.579
          16      1.748      0.005            9.153
          32      1.747      0.005           18.319
          64      1.741      0.005           36.754
         128      1.746      0.006           73.321
         256      1.747      0.003          146.510
         512      1.756      0.005          291.540
        1024      1.850      0.005          553.436
        2048      1.922      0.005         1065.349
        4096      2.054      0.006         1993.963
        8192      2.342      0.008         3497.714
       16384      4.022      0.009         4073.417
       32768      9.231      0.023         3549.756
       65536     16.963      0.027         3863.518
      131072     32.997      0.297         3972.189
      262144     64.729      0.043         4049.897
      524288    128.234      0.089         4088.523
     1048576    167.923      0.535         6244.399
     2097152    346.319      0.630         6055.552
     4194304    744.334     97.164         5634.973

 No Errors
$ mpirun -n 4 ./barrier
TEST barrier:
Barrier latency 0.897 +/- 0.003 us
Barrier latency 0.900 +/- 0.003 us
Barrier latency 0.901 +/- 0.003 us
Barrier latency 0.895 +/- 0.003 us
Barrier latency 0.900 +/- 0.003 us
Barrier latency 0.899 +/- 0.002 us
Barrier latency 0.901 +/- 0.003 us
Barrier latency 0.894 +/- 0.003 us
Barrier latency 0.898 +/- 0.009 us
Barrier latency 0.901 +/- 0.003 us

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2410_bench branch 2 times, most recently from 8da66ac to da964c2 Compare November 4, 2024 01:00
@hzhou hzhou changed the title test/bench: add bcast benchmark test/bench: add bcast/get/put benchmark Jan 19, 2025
@hzhou hzhou requested a review from raffenet January 19, 2025 15:18
hzhou added 4 commits January 22, 2025 16:16
* Add or improve some comments

* Move MAX_BUFSIZE and NUM_REPEAT to bench_frame.def since it is
not specific to p2p tests

* Make grank and gsize global so we don't need to pass between
functions.

* Add MEM_TYPES macro to different ways of setting memtype and memory
devices.

* use foreach_size macro in benchmarks

* macro run_stat optionally accept RUN_STAT_VARIANCE to calculate
variance rather than standard deviation. This is useful if we want to
collectively reduce variance.

* macros warm_up add optional WARM_UP_NUM_BEST to control the quality
of warm_up. The higher the longer to warm up.
* macro warm_up to allow custom MIN_ITER.
* macro warm_up to set a ceiling iter=10000. Some collective will
shortcut skip for 0-sized data. This prevent too much unnecessary
warm-up iterations.
Add bench_coll.def for common collective benchmark macros.

Add Barrier benchmarks. Since this is trivial, we run it 10 times to
show some fluctuations.

Adding bcast bench test. This version will measure cumulative latency
(as p2p_latency) combined with barrier + bcast. The reported data
sutracts the barrier latency.
The barrier often does not exit uniformly especially if node-topology is
in play. This affects different collectie algorithms differently, thus
using the combined latency doesn't hides too much details for algorithm
comparisons.

The osu microbenchmarks measures collective latency individually then
reduce for min, max, and average. Why it is still suceptible to barrier
behavior, it does provide more details for some insights comparing
different algorithms.
Measure the bandwidth for put and get similar to p2p_bw.

Note: use "-sendmem [type] -recvmem [type]" to set memory types.
@hzhou hzhou merged commit 980cf27 into pmodels:main Jan 22, 2025
4 checks passed
@hzhou hzhou deleted the 2410_bench branch January 22, 2025 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants