Merge pull request #1665 from LLNL/feature/burmark1/multireduce

Add MultiReducer
LLNL · Jul 12, 2024 · c1cffa9 · c1cffa9
2 parents 25b4f0d + 758d065
commit c1cffa9
Show file tree

Hide file tree

Showing 73 changed files with 7,487 additions and 448 deletions.
diff --git a/docs/sphinx/user_guide/cook_book.rst b/docs/sphinx/user_guide/cook_book.rst
@@ -20,4 +20,5 @@ to provide users with complete beyond usage examples beyond what can be found in
  :maxdepth: 2
 
  cook_book/reduction
+ cook_book/multi-reduction
 
diff --git a/docs/sphinx/user_guide/cook_book/multi-reduction.rst b/docs/sphinx/user_guide/cook_book/multi-reduction.rst
@@ -0,0 +1,160 @@
+.. ##
+.. ## Copyright (c) 2016-24, Lawrence Livermore National Security, LLC
+.. ## and other RAJA project contributors. See the RAJA/LICENSE file
+.. ## for details.
+.. ##
+.. ## SPDX-License-Identifier: (BSD-3-Clause)
+.. ##
+
+.. _cook-book-multi-reductions-label:
+
+============================
+Cooking with MultiReductions
+============================
+
+Please see the following section for overview discussion about RAJA multi-reductions:
+
+ * :ref:`feat-multi-reductions-label`.
+
+
+---------------------------------
+MultiReductions with RAJA::forall
+---------------------------------
+
+Here is the setup for a simple multi-reduction example::
+
+ const int N = 1000;
+ const int num_bins = 10;
+
+ int vec[N];
+ int bins[N];
+
+ for (int i = 0; i < N; ++i) {
+
+ vec[i] = 1;
+ bins[i] = i % num_bins;
+
+ }
+
+Here a simple sum multi-reduction performed in a C-style for-loop::
+
+ int vsum[num_bins] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ // Run a kernel using the multi-reduction objects
+ for (int i = 0; i < N; ++i) {
+
+ vsum[bins[i]] += vec[i];
+
+ }
+
+The results of these operations will yield the following values:
+
+ * ``vsum[0] == 100``
+ * ``vsum[1] == 100``
+ * ``vsum[2] == 100``
+ * ``vsum[3] == 100``
+ * ``vsum[4] == 100``
+ * ``vsum[5] == 100``
+ * ``vsum[6] == 100``
+ * ``vsum[7] == 100``
+ * ``vsum[8] == 100``
+ * ``vsum[9] == 100``
+
+RAJA uses policy types to specify how things are implemented.
+
+The forall *execution policy* specifies how the loop is run by the ``RAJA::forall`` method. The following discussion includes examples of several other RAJA execution policies that could be applied.
+For example ``RAJA::seq_exec`` runs a C-style for-loop sequentially on a CPU. The
+``RAJA::cuda_exec_with_reduce<256>`` runs the operation as a CUDA GPU kernel with
+256 threads per block and other CUDA kernel launch parameters, like the
+number of blocks, optimized for performance with multi_reducers.::
+
+ using exec_policy = RAJA::seq_exec;
+ // using exec_policy = RAJA::omp_parallel_for_exec;
+ // using exec_policy = RAJA::cuda_exec_with_reduce<256>;
+ // using exec_policy = RAJA::hip_exec_with_reduce<256>;
+
+The multi-reduction policy specifies how the multi-reduction is done and must be compatible with the
+execution policy. For example, ``RAJA::seq_multi_reduce`` does a sequential multi-reduction
+and can only be used with sequential execution policies. The
+``RAJA::cuda_multi_reduce_atomic`` policy uses atomics and can only be used with
+cuda execution policies. Similarly for other RAJA execution back-ends, such as
+HIP and OpenMP. Here are example RAJA multi-reduction policies whose names are
+indicative of which execution policies they work with::
+
+ using multi_reduce_policy = RAJA::seq_multi_reduce;
+ // using multi_reduce_policy = RAJA::omp_multi_reduce;
+ // using multi_reduce_policy = RAJA::cuda_multi_reduce_atomic;
+ // using multi_reduce_policy = RAJA::hip_multi_reduce_atomic;
+
+Here a simple sum multi-reduction is performed using RAJA::
+
+ RAJA::MultiReduceSum<multi_reduce_policy, int> vsum(num_bins, 0);
+
+ RAJA::forall<exec_policy>( RAJA::RangeSegment(0, N),
+ [=](RAJA::Index_type i) {
+
+ vsum[bins[i]] += vec[i];
+
+ });
+
+The results of these operations will yield the following values:
+
+ * ``vsum[0].get() == 100``
+ * ``vsum[1].get() == 100``
+ * ``vsum[2].get() == 100``
+ * ``vsum[3].get() == 100``
+ * ``vsum[4].get() == 100``
+ * ``vsum[5].get() == 100``
+ * ``vsum[6].get() == 100``
+ * ``vsum[7].get() == 100``
+ * ``vsum[8].get() == 100``
+ * ``vsum[9].get() == 100``
+
+Another option for the execution policy when using the CUDA or HIP backends are
+the base policies which have a boolean parameter to choose between the general
+use ``cuda/hip_exec`` policy and the ``cuda/hip_exec_with_reduce`` policy.::
+
+ // static constexpr bool with_reduce = ...;
+ // using exec_policy = RAJA::cuda_exec_base<with_reduce, 256>;
+ // using exec_policy = RAJA::hip_exec_base<with_reduce, 256>;
+
+
+---------------------------
+Rarely Used MultiReductions
+---------------------------
+
+Multi-reductions consume resources even if they are not used in a
+loop kernel. If a multi-reducer is conditionally used to set an error flag, for example, even
+if the multi-reduction is not used at runtime in the loop kernel, then the setup
+and finalization for the multi-reduction is still done and any resources are
+still allocated and deallocated. To minimize these overheads, some backends have
+special policies that minimize the amount of work the multi-reducer does in the
+case that it is not used at runtime even if it is compiled into a loop kernel.
+Here are example RAJA multi-reduction policies that have minimal overhead::
+
+ using rarely_used_multi_reduce_policy = RAJA::seq_multi_reduce;
+ // using rarely_used_multi_reduce_policy = RAJA::omp_multi_reduce;
+ // using rarely_used_multi_reduce_policy = RAJA::cuda_multi_reduce_atomic_low_performance_low_overhead;
+ // using rarely_used_multi_reduce_policy = RAJA::hip_multi_reduce_atomic_low_performance_low_overhead;
+
+Here is a simple rarely used bitwise-or multi-reduction performed using RAJA::
+
+ RAJA::MultiReduceBitOr<rarely_used_multi_reduce_policy, int> vor(num_bins, 0);
+
+ RAJA::forall<exec_policy>( RAJA::RangeSegment(0, N),
+ [=](RAJA::Index_type i) {
+
+ if (vec[i] < 0) {
+ vor[0] |= 1;
+ }
+
+ });
+
+The results of these operations will yield the following value if the condition
+is never met:
+
+ * ``vsum[0].get() == 0``
+
+or yield the following value if the condition is ever met:
+
+ * ``vsum[0].get() == 1``
diff --git a/docs/sphinx/user_guide/feature/multi-reduction.rst b/docs/sphinx/user_guide/feature/multi-reduction.rst
@@ -0,0 +1,227 @@
+.. ##
+.. ## Copyright (c) 2016-24, Lawrence Livermore National Security, LLC
+.. ## and other RAJA project contributors. See the RAJA/LICENSE file
+.. ## for details.
+.. ##
+.. ## SPDX-License-Identifier: (BSD-3-Clause)
+.. ##
+
+.. _feat-multi-reductions-label:
+
+=========================
+MultiReduction Operations
+=========================
+
+RAJA provides multi-reduction types that allow users to perform a runtime number
+of reduction operations in kernels launched using ``RAJA::forall``, ``RAJA::kernel``,
+and ``RAJA::launch`` methods in a portable, thread-safe manner. Users may
+use as many multi-reduction objects in a loop kernel as they need. If a small
+fixed number of reductions is required in a loop kernel then standard RAJA reduction objects can be
+used. Available RAJA multi-reduction types are described in this section.
+
+.. note:: All RAJA multi-reduction types are located in the namespace ``RAJA``.
+
+Also
+
+.. note:: * Each RAJA multi-reduction type is templated on a **multi-reduction policy**
+ and a **reduction value type** for the multi-reduction variable. The
+ **multi-reduction policy type must be compatible with the execution
+ policy used by the kernel in which it is used.** For example, in
+ a CUDA kernel, a CUDA multi-reduction policy must be used.
+ * Each RAJA multi-reduction type accepts an **initial reduction value or
+ values** at construction (see below).
+ * Each RAJA multi-reduction type has a 'get' method to access reduced
+ values after kernel execution completes.
+
+Please see the following sections for a description of reducers:
+
+ * :ref:`feat-reductions-label`.
+
+Please see the following cook book sections for guidance on policy usage:
+
+ * :ref:`cook-book-multi-reductions-label`.
+
+
+--------------------
+MultiReduction Types
+--------------------
+
+RAJA supports three common multi-reduction types:
+
+* ``MultiReduceSum< multi_reduce_policy, data_type >`` - Sum of values.
+
+* ``MultiReduceMin< multi_reduce_policy, data_type >`` - Min value.
+
+* ``MultiReduceMax< multi_reduce_policy, data_type >`` - Max value.
+
+and two less common bitwise multi-reduction types:
+
+* ``MultiReduceBitAnd< multi_reduce_policy, data_type >`` - Bitwise 'and' of values (i.e., ``a & b``).
+
+* ``MultiReduceBitOr< multi_reduce_policy, data_type >`` - Bitwise 'or' of values (i.e., ``a | b``).
+
+.. note:: ``RAJA::MultiReduceBitAnd`` and ``RAJA::MultiReduceBitOr`` reduction types are designed to work on integral data types because **in C++, at the language level, there is no such thing as a bitwise operator on floating-point numbers.**
+
+-----------------------
+MultiReduction Examples
+-----------------------
+
+Next, we provide a few examples to illustrate basic usage of RAJA multi-reduction
+types.
+
+Here is a simple RAJA multi-reduction example that shows how to use a sum
+multi-reduction type::
+
+ const int N = 1000;
+ const int B = 10;
+
+ //
+ // Initialize an array of length N with all ones, and another array to
+ // integers between 0 and B-1
+ //
+ int vec[N];
+ int bins[N];
+ for (int i = 0; i < N; ++i) {
+ vec[i] = 1;
+ bins[i] = i % B;
+ }
+
+ // Create a sum multi-reduction object with a size of B, and initial
+ // values of zero
+ RAJA::MultiReduceSum< RAJA::omp_multi_reduce, int > vsum(B, 0);
+
+ // Run a kernel using the multi-reduction object
+ RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N),
+ [=](RAJA::Index_type i) {
+
+ vsum[bins[i]] += vec[i];
+
+ });
+
+ // After kernel is run, extract the reduced values
+ int my_vsums[B];
+ for (int bin = 0; bin < B; ++bin) {
+ my_vsums[bin] = vsum[bin].get();
+ }
+
+The results of these operations will yield the following values:
+
+ * my_vsums[0] == 100
+ * my_vsums[1] == 100
+ * my_vsums[2] == 100
+ * my_vsums[3] == 100
+ * my_vsums[4] == 100
+ * my_vsums[5] == 100
+ * my_vsums[6] == 100
+ * my_vsums[7] == 100
+ * my_vsums[8] == 100
+ * my_vsums[9] == 100
+
+
+Here is the same example but using values stored in a container::
+
+ const int N = 1000;
+ const int B = 10;
+
+ //
+ // Initialize an array of length N with all ones, and another array to
+ // integers between 0 and B-1
+ //
+ int vec[N];
+ int bins[N];
+ for (int i = 0; i < N; ++i) {
+ vec[i] = 1;
+ bins[i] = i % B;
+ }
+
+ // Create a vector with a size of B, and initial values of zero
+ std::vector<int> my_vsums(B, 0);
+
+ // Create a multi-reducer initalized with size and values from my_vsums
+ RAJA::MultiReduceSum< RAJA::omp_multi_reduce, int > vsum(my_vsums);
+
+ // Run a kernel using the multi-reduction object
+ RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N),
+ [=](RAJA::Index_type i) {
+
+ vsum[bins[i]] += vec[i];
+
+ });
+
+ // After kernel is run, extract the reduced values back into my_vsums
+ vsum.get_all(my_vsums);
+
+The results of these operations will yield the following values:
+
+ * my_vsums[0] == 100
+ * my_vsums[1] == 100
+ * my_vsums[2] == 100
+ * my_vsums[3] == 100
+ * my_vsums[4] == 100
+ * my_vsums[5] == 100
+ * my_vsums[6] == 100
+ * my_vsums[7] == 100
+ * my_vsums[8] == 100
+ * my_vsums[9] == 100
+
+
+
+
+
+Here is an example of a bitwise-or multi-reduction::
+
+ const int N = 128;
+ const int B = 8;
+
+ //
+ // Initialize an array of length N to integers between 0 and B-1
+ //
+ int bins[N];
+ for (int i = 0; i < N; ++i) {
+ bins[i] = i % B;
+ }
+
+ // Create a bitwise-or multi-reduction object with initial value of '0'
+ RAJA::MultiReduceBitOr< RAJA::omp_multi_reduce, int > vor(B, 0);
+
+ // Run a kernel using the multi-reduction object
+ RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N),
+ [=](RAJA::Index_type i) {
+
+ vor[bins[i]] |= i;
+
+ });
+
+ // After kernel is run, extract the reduced values
+ int my_vors[B];
+ for (int bin = 0; bin < B; ++bin) {
+ my_vors[bin] = vor[bin].get();
+ }
+
+The results of these operations will yield the following values:
+
+ * my_vors[0] == 120 == 0b1111000
+ * my_vors[1] == 121 == 0b1111001
+ * my_vors[2] == 122 == 0b1111010
+ * my_vors[3] == 123 == 0b1111011
+ * my_vors[4] == 124 == 0b1111100
+ * my_vors[5] == 125 == 0b1111101
+ * my_vors[6] == 126 == 0b1111110
+ * my_vors[7] == 127 == 0b1111111
+
+The results of the multi-reduction start at 120 and increase to 127. In binary
+representation (i.e., bits), :math:`120 = 0b1111000` and :math:`127 = 0b1111111`.
+The bins were picked in such a way that all the integers in a bin had the same
+remainder modulo 8 so their last 3 binary digits were all the same while their
+upper binary digits varied. Because bitwise-or keeps all the set bits, the upper
+bits are all set because at least one integer in that bin set them. The last
+3 bits were the same in all the integers so the last 3 bits are the same as the
+remainder modulo 8 of the bin number.
+
+-----------------------
+MultiReduction Policies
+-----------------------
+
+For more information about available RAJA multi-reduction policies and guidance
+on which to use with RAJA execution policies, please see
+:ref:`multi-reducepolicy-label`.