Add CUDA/HIP implementations of reduction operators #12569

devreal · 2024-05-23T01:03:40Z

This is the second part #12318, which provides the device-side reduction operators and adds stream semantics to ompi_op_reduce.

As usual, the operators are generated from macros. Function pointers to kernel launch functions are stored inside the ompi_op_t as a pointer to a struct that is filled if accelerator support is available.

There are two pieces to the cuda/hip implementation:

A small wrapper that takes care of local configuration specificities (type sizes in Fortran, for example) and that handles OMPI/OPAL types.
An implementation that is agnostic of OMPI/OPAL headers. We found that some vendor compilers didn't like pulling in OMPI/OPAL headers so we had to split.

Currently not supported are short float and long double since they are either not supported everywhere or not standardized. I hope I caught all other types, including pair types for loc functions. Since the implementations are agnostic of OMPI/OPAL headers, the code has to map the fortran types to C types in the implementation.

The device_op_pre and device_op_post functions are there to setup the environment for the kernel, including allocating memory on the device if one of the inputs is not on the chosen device. Operators cannot return an error so whatever the caller feeds us we have to eat. Not pretty, but hopefully better than aborting.

This branch requires #12356. I will rebase once that is merged.

Questions:

Should the rocm component be renamed hip? (I'm afraid it should, @edgargabriel :D)
How do people feel about generating the hip component from the cuda component using hipify+sed scripts? We'd alway require hipify to be available and I could use some help integrating that into the build system but there really isn't much difference between the two.

ompi/mca/op/rocm/Makefile.am

ompi/mca/op/rocm/op_rocm.h

ompi/mca/op/rocm/op_rocm_component.c

ompi/mca/op/rocm/op_rocm_functions.c

edgargabriel · 2024-06-04T13:38:26Z

ompi/mca/op/rocm/op_rocm_functions.c

+    }
+
+    if (MCA_ACCELERATOR_NO_DEVICE_ID == target_device) {
+        opal_accelerator.mem_release_stream(device, target, stream);


just as a thought for a subsequent PR, we could get rid of the mem_alloc and mem_release functions in the accelerator framework interfaces and have only the stream based version, with the default stream being used if no stream argument has been provided by the user. This would reduce the API functions a bit and avoid nearly identical code.

edgargabriel · 2024-06-04T13:41:08Z

ompi/mca/op/rocm/Makefile.am

+sources = op_rocm_component.c op_rocm.h op_rocm_functions.c op_rocm_impl.h
+rocm_sources = op_rocm_impl.hip
+
+HIPCC = hipcc


we might have to change that in the near future. hipcc is going away, we should be using amdclang with --offload-arch arguments. Its ok to leave it for now as is.

bosilca · 2024-05-23T03:32:20Z

ompi/mca/op/cuda/op_cuda.h

+#define xstr(x) #x
+#define str(x) xstr(x)
+
+#define CHECK(fn, args)                                       \


We don't abort inside the software stack.

Proposal: add a return value to the internal operator API and wrap user-defined operators that don't provide a return. That adds quite a bit of churn to this PR and touches many more places. Maybe that should be a separate change?

@bosilca Are you ok with deferring the change of the internal operator API to a separate PR and leaving the abort in for now?

ompi/mca/op/base/op_base_op_select.c

ompi/mca/op/cuda/Makefile.am

ompi/mca/op/cuda/op_cuda_functions.c

bosilca · 2024-06-16T15:43:49Z

ompi/mca/op/cuda/Makefile.am

+
+# -o $($@.o:.lo)
+
+# Open MPI components can be compiled two ways:


This is especially not true for this component, it can only be built dynamically.

The operator support should only be built dynamically? @edgargabriel suggested that they should be made dynamic by default but should we disallow static building entirely?

If I correctly understand, allowing static builds forces libompi.so to have a dependency on CUDA. This will break the build on non-CUDA machines.

The accelerator components are dynamic-by-default (#12055) but I couldn't find a similar mechanism for OMPI. We should still allow building the ops statically, for those who know what they are doing.

As soon as a component calls into libcuda (or more precisely in this case libcudart) it never be built statically.

I'm not sure why that is. The OMPI library would have to be linked against libcudart but that's possible if you build for a CUDA environment specifically.

I marked the two op modules as dso-by-default now.

Just for the sake of it, I build ompi/main with CUDA from scratch, and now the dependency on libcudart exists everywhere, including ompi_info.

Yes, this is broken on main, this branch doesn't change that.

ompi/mca/op/cuda/op_cuda_impl.cu

bosilca · 2024-06-17T13:48:06Z

ompi/mca/op/cuda/op_cuda_impl.cu

+        const int stride = blockDim.x * gridDim.x;                                                  \
+        for (int i = index; i < n/vlen; i += stride) {                                              \
+            vtype vin = ((vtype*)in)[i];                                                            \
+            vtype vinout = ((vtype*)inout)[i];                                                      \


Why don't you use the templated op defined earlier in the file ? Or if you don't need it you should remove it.

I am reworking the vectorization to make it more flexible and avoid some of the stuff I had to do to map the fixed size integers onto vectors of variable size integers.

I reworked the vectorization with a custom type and some template work. The goal now is to consistently have 128bit loads and stores.

ompi/mca/op/cuda/op_cuda_impl.cu

bosilca · 2024-06-17T13:54:41Z

ompi/mca/op/op.h

    /** Function pointers for all the different datatypes to be used
        with the MPI_Op that this module is used with */
-    ompi_op_base_handler_fn_1_0_0_t opm_fns[OMPI_OP_BASE_TYPE_MAX];
-    ompi_op_base_3buff_handler_fn_1_0_0_t opm_3buff_fns[OMPI_OP_BASE_TYPE_MAX];
+    union {


Overly complicated, but I can't think of anything significantly better right now.

ompi/op/op.h

devreal · 2024-06-19T15:38:51Z

ompi/mca/op/rocm/op_rocm_functions.c

+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wgnu-zero-variadic-macro-arguments"
+
+static inline void device_op_pre(const void *orig_source1,


@bosilca @edgargabriel
If the device_op_pre and device_op_post use the accelerator framework they are pretty much independent of the the model (minus the last two lines). I wonder whether they should be moved to a header in base/ and shared between the two implementations.

The last two lines can be taken out and put into the op macro from where they are called.

ompi/mca/op/cuda/op_cuda_component.c

bosilca · 2024-06-21T00:30:02Z

Let me add some generic comment here, mostly as a reminder to self. The reason is that I don't think this is how we should use these op modules, especially not with accelerators. In my vision we decide once and for all, for each operation (or collective) what MPI_Op we will use, and we will stay with it for the entire duration. First, because there is no reason to execute half of the MPI_Op on the host and the other half on the device, it is all or none. Second, because we definitely don't want to start each kernel independently, the overhead will be just too costly, annihilating most of the benefits. Instead, once we start a collective, we would start a "service" bound to a specific context (GPU or CPU) and this service will remain active for as long as we are in a collective that needs GPU op, removing all costs related to the kernel submission. Instead, the GPU threads will poll into a well defined memory location for work updates, and the CPU will post new ops in this queue.

The only drawback I can see is that the service will take resources from the application, but this loss is very small, as a single (or two SM) are more than enough to saturate the network bandwidth. Once, we are outside collectives requiring GPU op, we can release these resources back to the application.

devreal · 2024-07-02T02:53:02Z

The CUDA test fails because we detect CUDA but NVCC is not available (at least in PATH). We'll need to check for NVCC to be available and bail out if not. Ideally, we can make NVCC available in CI as well.

config/opal_check_nvcc.m4

devreal · 2024-07-09T22:00:58Z

I updated the PR to have precious variables NVCC, HIPCC, NVCCFLAGS, and HIPCCFLAGS so that they can be specified on the command line and show up at the bottom ./configure --help. I will squash all changes and rebase once approved.

ompi/mca/op/cuda/op_cuda_impl.cu

The operators are generated from macros. Function pointers to kernel launch functions are stored inside the ompi_op_t as a pointer to a struct that is filled if accelerator support is available. The ompi_op* API is extended to include versions taking streams and device IDs to allow enqueuing operators on streams. The old functions map to the stream versions with a NULL stream. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

CUDA provides only limited vector widths and only for variable width integer types. We use our own vector type and some C++ templates to get more flexible vectors. We aim to get 128bit loads by adjusting the width based on the type size. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

bosilca · 2024-10-01T14:42:52Z

config/opal_check_nvcc.m4

+          # try to find nvcc in PATH
+          [AC_PATH_PROG([NVCC], [nvcc], [])])
+
+    # disable ussage of NVCC if explicitly specified


Suggested change

# disable ussage of NVCC if explicitly specified

# disable usage of NVCC if explicitly specified

bosilca

In addition to the comments I left on the PR I have one issue with the lazy initialization part. In general it was a good idea to delay the expensive, but necessary initialization, until we know we need it. Fair. However, here we don't even know we can support it, so that module will always be loaded and on our way. Basically, we have no way of removing it if the lazy initialization fails.

bosilca · 2024-10-01T15:45:35Z

ompi/mca/op/cuda/op_cuda_component.c

+        int num_devices;
+        int rc;
+        // TODO: is this init needed here?
+        cuInit(0);


I think this part should be only once for all CUDA related components. We might need to move it into the common.

bosilca · 2024-10-01T16:04:39Z

ompi/mca/op/cuda/op_cuda_functions.c

+            } else {
+                /* copy from one device to another device */
+                /* TODO: does this actually work? Can we enable P2P? */
+                CHECK(cuMemcpyDtoDAsync, ((CUdeviceptr)*source2, (CUdeviceptr)orig_source2, nbytes, *(CUstream*)stream->stream));


Thinking more about this I realized that this entire logic needs to be changed. I see three cases:

data located on different GPU belonging to the same process: manually copying the data upfront is bad for performance, GPUs are really good at doing this automatically, especially in the context of the same process.

data located on different GPU belonging to the different processes: we don't cover that case yet as it will require different reduction algorithms (as this capability would remove one explicit communication).

data located on main memory: here we only need to explicitly copy if the GPU does not have direct access to the data. We can determine this using the VMM patch that made it into main few days ago.

devreal · 2024-10-01T16:27:11Z

As I said earlier: please provide a patch for changes you want. I've run out of time to spend on this.

devreal added the Target: main label May 23, 2024

devreal requested review from bosilca, Akshay-Venkatesh and edgargabriel May 23, 2024 01:03

edgargabriel requested changes Jun 4, 2024

View reviewed changes

This was referenced Jun 10, 2024

Add accelerator-awareness to most allreduce implementations #12611

Open

Reduce_local Segmentation fault when Running with IMB-MPI1 built for GPU #12620

Closed

bosilca reviewed Jun 17, 2024

View reviewed changes

devreal force-pushed the op-accel-kernels branch 2 times, most recently from 3ab3371 to 3e4425d Compare June 19, 2024 12:48

devreal commented Jun 19, 2024

View reviewed changes

devreal force-pushed the op-accel-kernels branch from 3d1ef9c to 6a85957 Compare June 20, 2024 01:00

bosilca reviewed Jun 20, 2024

View reviewed changes

ompi/mca/op/cuda/op_cuda_component.c Outdated Show resolved Hide resolved

devreal force-pushed the op-accel-kernels branch 2 times, most recently from f689d6d to 25c24c9 Compare June 20, 2024 16:07

devreal force-pushed the op-accel-kernels branch from 1ed9579 to b56ab0a Compare July 1, 2024 17:46

devreal force-pushed the op-accel-kernels branch 2 times, most recently from 4d73198 to a4a84f5 Compare July 8, 2024 18:59

bosilca reviewed Jul 8, 2024

View reviewed changes

config/opal_check_nvcc.m4 Outdated Show resolved Hide resolved

devreal force-pushed the op-accel-kernels branch from 82d0090 to 360e7a9 Compare August 12, 2024 15:27

bosilca reviewed Sep 9, 2024

View reviewed changes

ompi/mca/op/cuda/op_cuda_impl.cu Show resolved Hide resolved

devreal and others added 6 commits September 15, 2024 19:49

Build op/cuda and op/rocm as dso by default

13aeecf

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Remove DECLSPEC from internal functions

bc5c3a1

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

op/cuda: Lazily initialize the CUDA information

c2c5aec

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

op/cuda: cleanup and remove short float remnants

37c5dad

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

devreal added 5 commits September 15, 2024 19:49

Add LDFLAGS to op/rocm linker command

4d4d629

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

First attempt to check for NVCC

9fe6351

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Add check for hipcc

60cc5aa

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Mark NVCC, NVCCFLAGS, HIPCC, and HIPCCFLAGS as precious

46fbda1

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Point CI workflows to nvcc/hipcc

730102b

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

devreal force-pushed the op-accel-kernels branch from 360e7a9 to 71f10c9 Compare September 15, 2024 23:49

More robust find for cudart

c200c02

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

devreal force-pushed the op-accel-kernels branch from 71f10c9 to c200c02 Compare September 16, 2024 02:21

bosilca reviewed Oct 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA/HIP implementations of reduction operators #12569

Add CUDA/HIP implementations of reduction operators #12569

devreal commented May 23, 2024

edgargabriel Jun 4, 2024

edgargabriel Jun 4, 2024

bosilca May 23, 2024

devreal Sep 9, 2024

devreal Oct 1, 2024

bosilca Jun 16, 2024

devreal Jun 19, 2024

bosilca Jun 19, 2024

devreal Jun 19, 2024

bosilca Jun 19, 2024 •

edited

Loading

devreal Jun 19, 2024

bosilca Jun 20, 2024

devreal Jun 20, 2024

bosilca Jun 17, 2024

devreal Jun 20, 2024

devreal Jun 20, 2024

bosilca Jun 17, 2024

devreal Jun 19, 2024

bosilca commented Jun 21, 2024

devreal commented Jul 2, 2024

devreal commented Jul 9, 2024

bosilca Oct 1, 2024

bosilca left a comment

bosilca Oct 1, 2024

bosilca Oct 1, 2024

devreal commented Oct 1, 2024


		# -o $($@.o:.lo)

		# Open MPI components can be compiled two ways:

	# disable ussage of NVCC if explicitly specified
	# disable usage of NVCC if explicitly specified

Add CUDA/HIP implementations of reduction operators #12569

Are you sure you want to change the base?

Add CUDA/HIP implementations of reduction operators #12569

Conversation

devreal commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosilca Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosilca commented Jun 21, 2024

devreal commented Jul 2, 2024

devreal commented Jul 9, 2024

Choose a reason for hiding this comment

bosilca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devreal commented Oct 1, 2024

bosilca Jun 19, 2024 •

edited

Loading