Skip to content

Commit

Permalink
Merge pull request #1665 from LLNL/feature/burmark1/multireduce
Browse files Browse the repository at this point in the history
Add MultiReducer
  • Loading branch information
MrBurmark authored Jul 12, 2024
2 parents 25b4f0d + 758d065 commit c1cffa9
Show file tree
Hide file tree
Showing 73 changed files with 7,487 additions and 448 deletions.
1 change: 1 addition & 0 deletions docs/sphinx/user_guide/cook_book.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ to provide users with complete beyond usage examples beyond what can be found in
:maxdepth: 2

cook_book/reduction
cook_book/multi-reduction

160 changes: 160 additions & 0 deletions docs/sphinx/user_guide/cook_book/multi-reduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
.. ##
.. ## Copyright (c) 2016-24, Lawrence Livermore National Security, LLC
.. ## and other RAJA project contributors. See the RAJA/LICENSE file
.. ## for details.
.. ##
.. ## SPDX-License-Identifier: (BSD-3-Clause)
.. ##
.. _cook-book-multi-reductions-label:

============================
Cooking with MultiReductions
============================

Please see the following section for overview discussion about RAJA multi-reductions:

* :ref:`feat-multi-reductions-label`.


---------------------------------
MultiReductions with RAJA::forall
---------------------------------

Here is the setup for a simple multi-reduction example::

const int N = 1000;
const int num_bins = 10;

int vec[N];
int bins[N];

for (int i = 0; i < N; ++i) {

vec[i] = 1;
bins[i] = i % num_bins;

}

Here a simple sum multi-reduction performed in a C-style for-loop::

int vsum[num_bins] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

// Run a kernel using the multi-reduction objects
for (int i = 0; i < N; ++i) {

vsum[bins[i]] += vec[i];

}

The results of these operations will yield the following values:

* ``vsum[0] == 100``
* ``vsum[1] == 100``
* ``vsum[2] == 100``
* ``vsum[3] == 100``
* ``vsum[4] == 100``
* ``vsum[5] == 100``
* ``vsum[6] == 100``
* ``vsum[7] == 100``
* ``vsum[8] == 100``
* ``vsum[9] == 100``

RAJA uses policy types to specify how things are implemented.

The forall *execution policy* specifies how the loop is run by the ``RAJA::forall`` method. The following discussion includes examples of several other RAJA execution policies that could be applied.
For example ``RAJA::seq_exec`` runs a C-style for-loop sequentially on a CPU. The
``RAJA::cuda_exec_with_reduce<256>`` runs the operation as a CUDA GPU kernel with
256 threads per block and other CUDA kernel launch parameters, like the
number of blocks, optimized for performance with multi_reducers.::

using exec_policy = RAJA::seq_exec;
// using exec_policy = RAJA::omp_parallel_for_exec;
// using exec_policy = RAJA::cuda_exec_with_reduce<256>;
// using exec_policy = RAJA::hip_exec_with_reduce<256>;

The multi-reduction policy specifies how the multi-reduction is done and must be compatible with the
execution policy. For example, ``RAJA::seq_multi_reduce`` does a sequential multi-reduction
and can only be used with sequential execution policies. The
``RAJA::cuda_multi_reduce_atomic`` policy uses atomics and can only be used with
cuda execution policies. Similarly for other RAJA execution back-ends, such as
HIP and OpenMP. Here are example RAJA multi-reduction policies whose names are
indicative of which execution policies they work with::

using multi_reduce_policy = RAJA::seq_multi_reduce;
// using multi_reduce_policy = RAJA::omp_multi_reduce;
// using multi_reduce_policy = RAJA::cuda_multi_reduce_atomic;
// using multi_reduce_policy = RAJA::hip_multi_reduce_atomic;

Here a simple sum multi-reduction is performed using RAJA::

RAJA::MultiReduceSum<multi_reduce_policy, int> vsum(num_bins, 0);

RAJA::forall<exec_policy>( RAJA::RangeSegment(0, N),
[=](RAJA::Index_type i) {

vsum[bins[i]] += vec[i];

});

The results of these operations will yield the following values:

* ``vsum[0].get() == 100``
* ``vsum[1].get() == 100``
* ``vsum[2].get() == 100``
* ``vsum[3].get() == 100``
* ``vsum[4].get() == 100``
* ``vsum[5].get() == 100``
* ``vsum[6].get() == 100``
* ``vsum[7].get() == 100``
* ``vsum[8].get() == 100``
* ``vsum[9].get() == 100``

Another option for the execution policy when using the CUDA or HIP backends are
the base policies which have a boolean parameter to choose between the general
use ``cuda/hip_exec`` policy and the ``cuda/hip_exec_with_reduce`` policy.::

// static constexpr bool with_reduce = ...;
// using exec_policy = RAJA::cuda_exec_base<with_reduce, 256>;
// using exec_policy = RAJA::hip_exec_base<with_reduce, 256>;


---------------------------
Rarely Used MultiReductions
---------------------------

Multi-reductions consume resources even if they are not used in a
loop kernel. If a multi-reducer is conditionally used to set an error flag, for example, even
if the multi-reduction is not used at runtime in the loop kernel, then the setup
and finalization for the multi-reduction is still done and any resources are
still allocated and deallocated. To minimize these overheads, some backends have
special policies that minimize the amount of work the multi-reducer does in the
case that it is not used at runtime even if it is compiled into a loop kernel.
Here are example RAJA multi-reduction policies that have minimal overhead::

using rarely_used_multi_reduce_policy = RAJA::seq_multi_reduce;
// using rarely_used_multi_reduce_policy = RAJA::omp_multi_reduce;
// using rarely_used_multi_reduce_policy = RAJA::cuda_multi_reduce_atomic_low_performance_low_overhead;
// using rarely_used_multi_reduce_policy = RAJA::hip_multi_reduce_atomic_low_performance_low_overhead;

Here is a simple rarely used bitwise-or multi-reduction performed using RAJA::

RAJA::MultiReduceBitOr<rarely_used_multi_reduce_policy, int> vor(num_bins, 0);

RAJA::forall<exec_policy>( RAJA::RangeSegment(0, N),
[=](RAJA::Index_type i) {

if (vec[i] < 0) {
vor[0] |= 1;
}

});

The results of these operations will yield the following value if the condition
is never met:

* ``vsum[0].get() == 0``

or yield the following value if the condition is ever met:

* ``vsum[0].get() == 1``
227 changes: 227 additions & 0 deletions docs/sphinx/user_guide/feature/multi-reduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
.. ##
.. ## Copyright (c) 2016-24, Lawrence Livermore National Security, LLC
.. ## and other RAJA project contributors. See the RAJA/LICENSE file
.. ## for details.
.. ##
.. ## SPDX-License-Identifier: (BSD-3-Clause)
.. ##
.. _feat-multi-reductions-label:

=========================
MultiReduction Operations
=========================

RAJA provides multi-reduction types that allow users to perform a runtime number
of reduction operations in kernels launched using ``RAJA::forall``, ``RAJA::kernel``,
and ``RAJA::launch`` methods in a portable, thread-safe manner. Users may
use as many multi-reduction objects in a loop kernel as they need. If a small
fixed number of reductions is required in a loop kernel then standard RAJA reduction objects can be
used. Available RAJA multi-reduction types are described in this section.

.. note:: All RAJA multi-reduction types are located in the namespace ``RAJA``.

Also

.. note:: * Each RAJA multi-reduction type is templated on a **multi-reduction policy**
and a **reduction value type** for the multi-reduction variable. The
**multi-reduction policy type must be compatible with the execution
policy used by the kernel in which it is used.** For example, in
a CUDA kernel, a CUDA multi-reduction policy must be used.
* Each RAJA multi-reduction type accepts an **initial reduction value or
values** at construction (see below).
* Each RAJA multi-reduction type has a 'get' method to access reduced
values after kernel execution completes.

Please see the following sections for a description of reducers:

* :ref:`feat-reductions-label`.

Please see the following cook book sections for guidance on policy usage:

* :ref:`cook-book-multi-reductions-label`.


--------------------
MultiReduction Types
--------------------

RAJA supports three common multi-reduction types:

* ``MultiReduceSum< multi_reduce_policy, data_type >`` - Sum of values.

* ``MultiReduceMin< multi_reduce_policy, data_type >`` - Min value.

* ``MultiReduceMax< multi_reduce_policy, data_type >`` - Max value.

and two less common bitwise multi-reduction types:

* ``MultiReduceBitAnd< multi_reduce_policy, data_type >`` - Bitwise 'and' of values (i.e., ``a & b``).

* ``MultiReduceBitOr< multi_reduce_policy, data_type >`` - Bitwise 'or' of values (i.e., ``a | b``).

.. note:: ``RAJA::MultiReduceBitAnd`` and ``RAJA::MultiReduceBitOr`` reduction types are designed to work on integral data types because **in C++, at the language level, there is no such thing as a bitwise operator on floating-point numbers.**

-----------------------
MultiReduction Examples
-----------------------

Next, we provide a few examples to illustrate basic usage of RAJA multi-reduction
types.

Here is a simple RAJA multi-reduction example that shows how to use a sum
multi-reduction type::

const int N = 1000;
const int B = 10;

//
// Initialize an array of length N with all ones, and another array to
// integers between 0 and B-1
//
int vec[N];
int bins[N];
for (int i = 0; i < N; ++i) {
vec[i] = 1;
bins[i] = i % B;
}

// Create a sum multi-reduction object with a size of B, and initial
// values of zero
RAJA::MultiReduceSum< RAJA::omp_multi_reduce, int > vsum(B, 0);

// Run a kernel using the multi-reduction object
RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N),
[=](RAJA::Index_type i) {

vsum[bins[i]] += vec[i];

});

// After kernel is run, extract the reduced values
int my_vsums[B];
for (int bin = 0; bin < B; ++bin) {
my_vsums[bin] = vsum[bin].get();
}

The results of these operations will yield the following values:

* my_vsums[0] == 100
* my_vsums[1] == 100
* my_vsums[2] == 100
* my_vsums[3] == 100
* my_vsums[4] == 100
* my_vsums[5] == 100
* my_vsums[6] == 100
* my_vsums[7] == 100
* my_vsums[8] == 100
* my_vsums[9] == 100


Here is the same example but using values stored in a container::

const int N = 1000;
const int B = 10;

//
// Initialize an array of length N with all ones, and another array to
// integers between 0 and B-1
//
int vec[N];
int bins[N];
for (int i = 0; i < N; ++i) {
vec[i] = 1;
bins[i] = i % B;
}

// Create a vector with a size of B, and initial values of zero
std::vector<int> my_vsums(B, 0);

// Create a multi-reducer initalized with size and values from my_vsums
RAJA::MultiReduceSum< RAJA::omp_multi_reduce, int > vsum(my_vsums);

// Run a kernel using the multi-reduction object
RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N),
[=](RAJA::Index_type i) {

vsum[bins[i]] += vec[i];

});

// After kernel is run, extract the reduced values back into my_vsums
vsum.get_all(my_vsums);

The results of these operations will yield the following values:

* my_vsums[0] == 100
* my_vsums[1] == 100
* my_vsums[2] == 100
* my_vsums[3] == 100
* my_vsums[4] == 100
* my_vsums[5] == 100
* my_vsums[6] == 100
* my_vsums[7] == 100
* my_vsums[8] == 100
* my_vsums[9] == 100





Here is an example of a bitwise-or multi-reduction::

const int N = 128;
const int B = 8;

//
// Initialize an array of length N to integers between 0 and B-1
//
int bins[N];
for (int i = 0; i < N; ++i) {
bins[i] = i % B;
}

// Create a bitwise-or multi-reduction object with initial value of '0'
RAJA::MultiReduceBitOr< RAJA::omp_multi_reduce, int > vor(B, 0);

// Run a kernel using the multi-reduction object
RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N),
[=](RAJA::Index_type i) {

vor[bins[i]] |= i;

});

// After kernel is run, extract the reduced values
int my_vors[B];
for (int bin = 0; bin < B; ++bin) {
my_vors[bin] = vor[bin].get();
}

The results of these operations will yield the following values:

* my_vors[0] == 120 == 0b1111000
* my_vors[1] == 121 == 0b1111001
* my_vors[2] == 122 == 0b1111010
* my_vors[3] == 123 == 0b1111011
* my_vors[4] == 124 == 0b1111100
* my_vors[5] == 125 == 0b1111101
* my_vors[6] == 126 == 0b1111110
* my_vors[7] == 127 == 0b1111111

The results of the multi-reduction start at 120 and increase to 127. In binary
representation (i.e., bits), :math:`120 = 0b1111000` and :math:`127 = 0b1111111`.
The bins were picked in such a way that all the integers in a bin had the same
remainder modulo 8 so their last 3 binary digits were all the same while their
upper binary digits varied. Because bitwise-or keeps all the set bits, the upper
bits are all set because at least one integer in that bin set them. The last
3 bits were the same in all the integers so the last 3 bits are the same as the
remainder modulo 8 of the bin number.

-----------------------
MultiReduction Policies
-----------------------

For more information about available RAJA multi-reduction policies and guidance
on which to use with RAJA execution policies, please see
:ref:`multi-reducepolicy-label`.
Loading

0 comments on commit c1cffa9

Please sign in to comment.