Offload reduction operations to accelerator devices #12318

devreal · 2024-02-06T20:34:08Z

This PR is an attempt to offload reduction operations in MPI_Allreduce to accelerator devices if the input buffer is located on a device.

A few notes:

There is a heuristic to determine when to launch a kernel for the reduction and when to pull the data to the host and perform the reduction there. That has not been well tested and the parameters probably need to be determined on startup.
We need to pass streams through the call hierarchy so that copy and kernel launches can be stream ordered. That requires changes in the operation API.
Data movement on the device is expensive so I adjusted the algorithms to use the 3buff variants of the ops whenever possible, instead of moving data explicitly. That turned out to be beneficial on the host as well, esp for larger reductions.
This is still WIP but I'm hoping that it can serve as a starting point for others working on device integration, as I have currently run out of time. I will try to rebase and fix conflicts soon.

Signed-off-by: Joseph Schuchart <jschuchart@leconte.icl.utk.edu>

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

If the target process is unable to execute an RDMA operation it instructs the origin to change the communication protocol. When this happen theorigin must be informed to cancel all pending RDMA operations, and release the rdma_frag. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

…or allreduce recursive doubling Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

edgargabriel · 2024-02-08T14:20:39Z

@devreal thank you for all of this work! Can I make a suggestion? This is a massive pr as it is at the moment. Could we try to break it down into multiple smaller pieces, that are more manageable? E.g.

a pr that contains the changes to the accelerator framework components
a pr that contains the changes to the op framework
...
the last one probably being the changes required to pull everything together and use the code

Ideally, even if a new feature in one of the components is not used initially, it can be reviewed and resolved independently, and if we do it right it shouldn't cause any issues as long as its not used. I am more than happy to assist/help with that process if you want.

jiaxiyan · 2024-02-14T18:47:27Z

@devreal Can you share how you configure the build? It seems that the C++ dependency is wrong when I build it.

devreal · 2024-02-20T15:39:12Z

@edgargabriel I agree, this should be split up. I will start with the accelerator framework.

jiaxiyan · 2024-02-23T00:20:35Z

@devreal We built with libfabric and collected some performance data of osu-micro-benchmarks on GPU instances(p4d.24xlarge).

osu_reduce latency on 1 single node with 96 processes per node

$ mpirun -np 96 --use-hwthread-cpus --mca pml ob1 -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f

# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      50.89             18.27             89.23        1000
2                      50.20             18.24             88.37        1000
4                      49.70             17.48            105.89        1000
8                      46.01             16.58             91.06        1000
16                     44.55             16.75             79.57        1000
32                     45.75             16.70             81.72        1000
64                     45.17             16.56             81.88        1000
128                    45.51             16.62             81.47        1000
256                    45.46             16.61             81.15        1000
512                    46.62             17.25             83.92        1000
1024                   46.39             17.17             82.71        1000
2048                   48.14             18.45             87.07        1000
4096                   59.19             23.05             97.78        1000
8192                   72.53             26.28            133.33        1000
16384                  91.32             33.93            265.90         100
32768                 115.79             42.75            200.79         100
65536                 183.01             86.72            319.37         100
131072                337.60            153.89            597.25         100
262144                653.80            328.93           1140.21         100
524288               1413.32            883.30           2473.55         100
1048576              4315.48           1318.93           8312.18         100


$ mpirun -np 96 --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f

# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                      52.15             23.16             92.76        1000
2                      52.96             22.55             96.52        1000
4                      51.36             21.01            103.20        1000
8                      47.36             20.21             85.74        1000
16                     45.91             20.14             78.97        1000
32                     45.37             20.05             78.47        1000
64                     45.63             17.62             79.62        1000
128                    46.00             17.89             81.53        1000
256                    46.86             18.01             84.94        1000
512                    46.27             18.25             82.34        1000
1024                   54.11             18.39            113.24        1000
2048                   48.34             18.96             87.42        1000
4096                   50.72             19.41             92.88        1000
8192                   68.85             42.03            119.58        1000
16384                  81.37             48.91            145.23         100
32768                 114.14             65.15            215.77         100
65536                 182.49            109.55            336.56         100
131072                331.24            182.98            600.06         100
262144                652.47            374.06           1201.86         100
524288               1373.64            862.05           2473.60         100
1048576              3890.68           1835.44           7693.43         100

osu_reduce latency on 2 nodes with 96 processes per node

$ mpirun -np 192 --hostfile /home/ec2-user/PortaFiducia/hostfile --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f

# OSU MPI-CUDA Reduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                      53.34             24.80            158.96        1000
2                      52.71             23.64            127.55        1000
4                      52.65             23.31            125.60        1000
8                      52.79             22.93            128.73        1000
16                     52.97             23.85            129.14        1000
32                     53.07             23.99            131.40        1000
64                     53.17             23.27            131.36        1000
128                    53.91             23.22            160.13        1000
256                    53.96             23.56            131.35        1000
512                    54.57             23.78            136.56        1000
1024                   55.48             24.05            134.61        1000
2048                   58.17             25.93            167.34        1000
4096                   60.06             28.02            144.41        1000
8192                   80.41             53.82            175.23        1000
16384                  96.45             64.75            188.37         100
32768                 130.03             81.92            257.46         100
65536                 216.15            137.95            670.17         100
131072                458.93            312.95           1157.73         100
262144               1099.14            807.35           2219.35         100
524288               3153.31           2125.70           4677.10         100
1048576              7281.71           6291.95           8827.05         100

osu_allreduce latency on 1 single node with 96 processes per node

$ mpirun -np 96 --use-hwthread-cpus --mca pml ob1 -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f


# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1901.97           1895.70           1909.23        1000
2                    1866.46           1856.51           1890.17        1000
4                    1849.15           1846.66           1864.29        1000
8                    1865.48           1857.94           1869.82        1000
16                   1889.69           1887.12           1892.54        1000
32                   1887.29           1874.57           1892.87        1000
64                   1866.39           1862.39           1878.10        1000
128                  1902.69           1881.43           1950.76        1000
256                  6846.33           6153.07           7516.33        1000
512                  4960.26           4586.52           5442.11        1000
1024                 1869.71           1867.50           1879.32        1000
2048                 1860.47           1850.71           1869.47        1000
4096                 1920.09           1914.88           1925.71        1000
8192                 1946.90           1940.31           1954.75        1000
16384                1954.01           1941.61           1966.76         100
32768                2169.78           2095.39           2231.91         100
65536                2208.15           2149.11           2267.01         100
131072               1294.21           1152.11           1401.04         100
262144               2070.86           1986.83           2133.47         100
524288               4630.58           4475.43           4727.98         100
1048576              9268.92           8822.77           9553.14         100


$ mpirun -np 96 --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f

# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1888.06           1874.91           1902.12        1000
2                    1870.06           1867.39           1881.06        1000
4                    1886.02           1881.17           1890.68        1000
8                    1885.91           1878.96           1909.19        1000
16                   1873.81           1852.46           1890.93        1000
32                   1866.67           1857.62           1887.08        1000
64                   1854.75           1848.20           1867.71        1000
128                  1856.85           1851.18           1865.18        1000
256                  1886.99           1879.35           1894.61        1000
512                  1879.01           1875.44           1888.82        1000
1024                 1879.15           1867.80           1897.18        1000
2048                 1886.56           1876.13           1905.26        1000
4096                 1891.85           1885.14           1909.16        1000
8192                 1964.46           1946.55           1980.99        1000
16384                1996.35           1977.00           2010.00         100
32768                2081.54           2046.57           2107.62         100
65536                2267.33           2206.32           2321.61         100
131072               1600.58           1421.01           1698.15         100
262144               2057.70           1933.97           2141.18         100
524288               4341.36           4102.91           4491.09         100
1048576              9190.83           8568.45           9572.64         100

osu_allreduce latency on 2 nodes with 96 processes per node

$ mpirun -np 192 --hostfile /home/ec2-user/PortaFiducia/hostfile --use-hwthread-cpus --mca pml cm -x LD_LIBRARY_PATH=/home/ec2-user/libfabric/install/lib:/home/ec2-user/ompi/install/lib -x PATH=/home/ec2-user/ompi/install/bin /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f

# OSU MPI-CUDA Allreduce Latency Test v7.3
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1918.58           1898.68           1950.81        1000
2                    1923.21           1902.49           1956.78        1000
4                    1919.62           1898.32           1954.29        1000
8                    1966.44           1945.62           1986.83        1000
16                   1932.10           1911.17           1967.35        1000
32                   1905.98           1885.25           1945.11        1000
64                   1938.28           1916.89           1980.00        1000
128                  1922.34           1898.96           1957.66        1000
256                  1923.65           1900.70           1959.54        1000
512                  2026.75           2003.67           2056.36        1000
1024                 1939.01           1916.19           1975.81        1000
2048                 1931.89           1907.90           1965.60        1000
4096                 1944.01           1919.52           1970.58        1000
8192                 1978.66           1956.17           2020.18        1000
16384                1972.88           1946.22           1994.44         100
32768                2315.21           2258.95           2487.19         100
65536                2493.70           2352.68           2578.54         100
131072               1861.39           1687.25           1978.94         100
262144               3510.34           3260.06           3757.74         100
524288               6721.11           6278.76           6963.19         100
1048576             14090.28          13764.27          14250.96         100

We found ireduce and iallreduce segfault with

[ip-172-31-26-243.ec2.internal:06486] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.ip-172-31-26-243.1000/jf.0/1369899008/shared_mem_cuda_pool.ip-172-31-26-243 could be created.
[ip-172-31-26-243.ec2.internal:06486] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
ERROR: No suitable module for op MPI_SUM on type MPI_CHAR found for device memory!

jiaxiyan · 2024-02-23T00:28:24Z

On a single node with UCX

$  mpirun -np 96  --use-hwthread-cpus --mca pml ucx /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_reduce -d cuda -f


# OSU MPI-CUDA Reduce Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                      49.54             41.59             57.18        1000
2                      49.45             41.00             58.08        1000
4                      49.67             41.47             57.88        1000
8                      50.98             40.16             69.13        1000
16                     49.80             40.49             59.02        1000
32                     50.35             42.51             58.11        1000
64                     52.15             42.93             66.98        1000
128                    52.04             42.93             59.65        1000
256                    51.89             42.68             60.94        1000
512                    51.82             42.38             60.71        1000
1024                   53.97             42.73             72.85        1000
2048                   54.96             45.69             71.26        1000
4096                   56.78             48.69             66.27        1000
8192                   63.23             54.73             84.31        1000
16384                  77.46             65.07             89.25         100
32768                 104.21             93.89            117.12         100
65536                 168.33            148.43            191.16         100
131072                321.52            283.51            366.13         100
262144                713.00            670.85            759.71         100
524288               1485.87           1429.74           1556.93         100
1048576              3981.97           3750.12           4288.07         100

$ mpirun -n 96  --use-hwthread-cpus --mca pml ucx /home/ec2-user/osu-micro-benchmarks/install/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -f

# OSU MPI-CUDA Allreduce Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
1                    1100.07            643.62           1575.28        1000
2                    1117.94            645.60           1569.19        1000
4                    1115.97            666.85           1574.60        1000
8                    1105.40            658.64           1532.47        1000
16                   1142.66            699.44           1610.27        1000
32                   1102.44            773.93           1468.40        1000
64                   1085.40            766.52           1442.15        1000
128                  1115.23            851.01           1473.89        1000
256                  1098.70            839.22           1431.71        1000
512                  1104.72            814.51           1441.95        1000
1024                 1112.75            857.42           1431.18        1000
2048                 1102.28            790.27           1490.52        1000
4096                 1126.28            767.91           1545.33        1000
8192                 1170.84            749.98           1673.32        1000
16384                1403.39            737.39           1912.97         100
32768                1371.41            728.90           2004.60         100
65536                1659.78            879.67           2435.50         100
131072               1407.15           1239.39           1499.39         100
262144               2490.19           2254.67           2633.81         100
524288               4649.12           4118.11           5064.59         100
1048576             10059.62           8996.23          10887.01         100

Joseph Schuchart and others added 30 commits November 7, 2023 18:09

Initial draft of CUDA device support for ops

35ff1da

Signed-off-by: Joseph Schuchart <jschuchart@leconte.icl.utk.edu>

First working version of CUDA op support

b7e6f89

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

Update copyright header

164388a

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

Fix minor bugs to get osu_allreduce working

d8110ac

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

cuMemAllocAsync is supported since CUDA 11.2.0

f609127

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

coll/base/allreduce: Condition device allocation on op/dtype support

8ae3dac

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

Make sure the device op callbacks are zero-initialized

655948f

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

Be more graceful when creating a context and stream

7cdc828

Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>

fix wrong call to memset

bdb16a1

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Add detector for cudart

5934f43

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Add CUDA stream-based allocator and memory pools

c2c3d0e

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Don't memset the CUDA op component, we need the version

5df449c

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Set the memory pool release threshold

812d068

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Implement device-compatible allocator to cache coll temporaries

a688c84

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Fix devicebucket allocator for larger sizes

bbd362d

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Stream-based reduction and ddt copy and 3buff cuda kernels, adopted f…

f2f0f2d

…or allreduce recursive doubling Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Remove extra copies from allreduce redscat and ring

8f5b503

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Allow ops and memcpy on managed memory from the host

1c68d17

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

reduce_local: add support for device memory

70dde0f

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Draft of ompi_op_select_device

e603bcc

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Second draft of ompi_op_select_device

60dd446

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Fix undefined symbols in cuda op component

c485ecf

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Fix off-by-one error in device-bucket allocator

793863c

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Heuristic to select op device based on element count

d2e8677

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

init op_rocm, not compilable yet

cd7e578

Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

implemented funcs in accelerator_rocm modules

2ccaa87

Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

add -I include path to Makefile

a6f1cce

Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

added rocm codes into test example

ce0b88d

Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

fixed kernel launches in hip

ad420fe

Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

devreal added 14 commits November 7, 2023 18:48

Allreduce: don't copy inputs if data can be accessed from the host

56bcfee

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Be more careful when releasing temporary receive buffers

a1f089e

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Remove debug output and dead code

33616e6

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Bump max devicebucket allocator max size to 1GB

9da8b54

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

accelerator/cuda: fix error message

93ded5e

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

CUDA: Select compute capability 52 by default

182e6fa

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Sqash const correctness warnings

e5eb45f

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Squash warnings about mismatched function pointer types

14a5372

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Squash printfs

1f63809

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Replace fprintf with show_help

3d9f33a

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Squash compiler warnings

c878c4f

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Clean up cuda and rocm op codes

1c6667d

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Minor tweak to CUDA op configury

7bb4b95

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Fix rebase errors

d1382c3

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

devreal added ⚠️ WIP-DNM! Target: main labels Feb 6, 2024

devreal requested review from bosilca, edgargabriel and wenduwan February 6, 2024 20:34

devreal marked this pull request as draft February 6, 2024 20:34

devreal mentioned this pull request Feb 20, 2024

Add stream operations to accelerator components #12356

Merged

This was referenced May 23, 2024

Add CUDA/HIP implementations of reduction operators #12569

Open

Make datatype copy stream-aware #12570

Open

This was referenced Jun 10, 2024

Add bucket allocator for device memory #12610

Open

Add accelerator-awareness to most allreduce implementations #12611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offload reduction operations to accelerator devices #12318

Offload reduction operations to accelerator devices #12318

devreal commented Feb 6, 2024

edgargabriel commented Feb 8, 2024 •

edited

Loading

jiaxiyan commented Feb 14, 2024

devreal commented Feb 20, 2024

jiaxiyan commented Feb 23, 2024

jiaxiyan commented Feb 23, 2024

Offload reduction operations to accelerator devices #12318

Are you sure you want to change the base?

Offload reduction operations to accelerator devices #12318

Conversation

devreal commented Feb 6, 2024

edgargabriel commented Feb 8, 2024 • edited Loading

jiaxiyan commented Feb 14, 2024

devreal commented Feb 20, 2024

jiaxiyan commented Feb 23, 2024

jiaxiyan commented Feb 23, 2024

edgargabriel commented Feb 8, 2024 •

edited

Loading