-
Notifications
You must be signed in to change notification settings - Fork 101
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1665 from LLNL/feature/burmark1/multireduce
Add MultiReducer
- Loading branch information
Showing
73 changed files
with
7,487 additions
and
448 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
.. ## | ||
.. ## Copyright (c) 2016-24, Lawrence Livermore National Security, LLC | ||
.. ## and other RAJA project contributors. See the RAJA/LICENSE file | ||
.. ## for details. | ||
.. ## | ||
.. ## SPDX-License-Identifier: (BSD-3-Clause) | ||
.. ## | ||
.. _cook-book-multi-reductions-label: | ||
|
||
============================ | ||
Cooking with MultiReductions | ||
============================ | ||
|
||
Please see the following section for overview discussion about RAJA multi-reductions: | ||
|
||
* :ref:`feat-multi-reductions-label`. | ||
|
||
|
||
--------------------------------- | ||
MultiReductions with RAJA::forall | ||
--------------------------------- | ||
|
||
Here is the setup for a simple multi-reduction example:: | ||
|
||
const int N = 1000; | ||
const int num_bins = 10; | ||
|
||
int vec[N]; | ||
int bins[N]; | ||
|
||
for (int i = 0; i < N; ++i) { | ||
|
||
vec[i] = 1; | ||
bins[i] = i % num_bins; | ||
|
||
} | ||
|
||
Here a simple sum multi-reduction performed in a C-style for-loop:: | ||
|
||
int vsum[num_bins] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; | ||
|
||
// Run a kernel using the multi-reduction objects | ||
for (int i = 0; i < N; ++i) { | ||
|
||
vsum[bins[i]] += vec[i]; | ||
|
||
} | ||
|
||
The results of these operations will yield the following values: | ||
|
||
* ``vsum[0] == 100`` | ||
* ``vsum[1] == 100`` | ||
* ``vsum[2] == 100`` | ||
* ``vsum[3] == 100`` | ||
* ``vsum[4] == 100`` | ||
* ``vsum[5] == 100`` | ||
* ``vsum[6] == 100`` | ||
* ``vsum[7] == 100`` | ||
* ``vsum[8] == 100`` | ||
* ``vsum[9] == 100`` | ||
|
||
RAJA uses policy types to specify how things are implemented. | ||
|
||
The forall *execution policy* specifies how the loop is run by the ``RAJA::forall`` method. The following discussion includes examples of several other RAJA execution policies that could be applied. | ||
For example ``RAJA::seq_exec`` runs a C-style for-loop sequentially on a CPU. The | ||
``RAJA::cuda_exec_with_reduce<256>`` runs the operation as a CUDA GPU kernel with | ||
256 threads per block and other CUDA kernel launch parameters, like the | ||
number of blocks, optimized for performance with multi_reducers.:: | ||
|
||
using exec_policy = RAJA::seq_exec; | ||
// using exec_policy = RAJA::omp_parallel_for_exec; | ||
// using exec_policy = RAJA::cuda_exec_with_reduce<256>; | ||
// using exec_policy = RAJA::hip_exec_with_reduce<256>; | ||
|
||
The multi-reduction policy specifies how the multi-reduction is done and must be compatible with the | ||
execution policy. For example, ``RAJA::seq_multi_reduce`` does a sequential multi-reduction | ||
and can only be used with sequential execution policies. The | ||
``RAJA::cuda_multi_reduce_atomic`` policy uses atomics and can only be used with | ||
cuda execution policies. Similarly for other RAJA execution back-ends, such as | ||
HIP and OpenMP. Here are example RAJA multi-reduction policies whose names are | ||
indicative of which execution policies they work with:: | ||
|
||
using multi_reduce_policy = RAJA::seq_multi_reduce; | ||
// using multi_reduce_policy = RAJA::omp_multi_reduce; | ||
// using multi_reduce_policy = RAJA::cuda_multi_reduce_atomic; | ||
// using multi_reduce_policy = RAJA::hip_multi_reduce_atomic; | ||
|
||
Here a simple sum multi-reduction is performed using RAJA:: | ||
|
||
RAJA::MultiReduceSum<multi_reduce_policy, int> vsum(num_bins, 0); | ||
|
||
RAJA::forall<exec_policy>( RAJA::RangeSegment(0, N), | ||
[=](RAJA::Index_type i) { | ||
|
||
vsum[bins[i]] += vec[i]; | ||
|
||
}); | ||
|
||
The results of these operations will yield the following values: | ||
|
||
* ``vsum[0].get() == 100`` | ||
* ``vsum[1].get() == 100`` | ||
* ``vsum[2].get() == 100`` | ||
* ``vsum[3].get() == 100`` | ||
* ``vsum[4].get() == 100`` | ||
* ``vsum[5].get() == 100`` | ||
* ``vsum[6].get() == 100`` | ||
* ``vsum[7].get() == 100`` | ||
* ``vsum[8].get() == 100`` | ||
* ``vsum[9].get() == 100`` | ||
|
||
Another option for the execution policy when using the CUDA or HIP backends are | ||
the base policies which have a boolean parameter to choose between the general | ||
use ``cuda/hip_exec`` policy and the ``cuda/hip_exec_with_reduce`` policy.:: | ||
|
||
// static constexpr bool with_reduce = ...; | ||
// using exec_policy = RAJA::cuda_exec_base<with_reduce, 256>; | ||
// using exec_policy = RAJA::hip_exec_base<with_reduce, 256>; | ||
|
||
|
||
--------------------------- | ||
Rarely Used MultiReductions | ||
--------------------------- | ||
|
||
Multi-reductions consume resources even if they are not used in a | ||
loop kernel. If a multi-reducer is conditionally used to set an error flag, for example, even | ||
if the multi-reduction is not used at runtime in the loop kernel, then the setup | ||
and finalization for the multi-reduction is still done and any resources are | ||
still allocated and deallocated. To minimize these overheads, some backends have | ||
special policies that minimize the amount of work the multi-reducer does in the | ||
case that it is not used at runtime even if it is compiled into a loop kernel. | ||
Here are example RAJA multi-reduction policies that have minimal overhead:: | ||
|
||
using rarely_used_multi_reduce_policy = RAJA::seq_multi_reduce; | ||
// using rarely_used_multi_reduce_policy = RAJA::omp_multi_reduce; | ||
// using rarely_used_multi_reduce_policy = RAJA::cuda_multi_reduce_atomic_low_performance_low_overhead; | ||
// using rarely_used_multi_reduce_policy = RAJA::hip_multi_reduce_atomic_low_performance_low_overhead; | ||
|
||
Here is a simple rarely used bitwise-or multi-reduction performed using RAJA:: | ||
|
||
RAJA::MultiReduceBitOr<rarely_used_multi_reduce_policy, int> vor(num_bins, 0); | ||
|
||
RAJA::forall<exec_policy>( RAJA::RangeSegment(0, N), | ||
[=](RAJA::Index_type i) { | ||
|
||
if (vec[i] < 0) { | ||
vor[0] |= 1; | ||
} | ||
|
||
}); | ||
|
||
The results of these operations will yield the following value if the condition | ||
is never met: | ||
|
||
* ``vsum[0].get() == 0`` | ||
|
||
or yield the following value if the condition is ever met: | ||
|
||
* ``vsum[0].get() == 1`` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
.. ## | ||
.. ## Copyright (c) 2016-24, Lawrence Livermore National Security, LLC | ||
.. ## and other RAJA project contributors. See the RAJA/LICENSE file | ||
.. ## for details. | ||
.. ## | ||
.. ## SPDX-License-Identifier: (BSD-3-Clause) | ||
.. ## | ||
.. _feat-multi-reductions-label: | ||
|
||
========================= | ||
MultiReduction Operations | ||
========================= | ||
|
||
RAJA provides multi-reduction types that allow users to perform a runtime number | ||
of reduction operations in kernels launched using ``RAJA::forall``, ``RAJA::kernel``, | ||
and ``RAJA::launch`` methods in a portable, thread-safe manner. Users may | ||
use as many multi-reduction objects in a loop kernel as they need. If a small | ||
fixed number of reductions is required in a loop kernel then standard RAJA reduction objects can be | ||
used. Available RAJA multi-reduction types are described in this section. | ||
|
||
.. note:: All RAJA multi-reduction types are located in the namespace ``RAJA``. | ||
|
||
Also | ||
|
||
.. note:: * Each RAJA multi-reduction type is templated on a **multi-reduction policy** | ||
and a **reduction value type** for the multi-reduction variable. The | ||
**multi-reduction policy type must be compatible with the execution | ||
policy used by the kernel in which it is used.** For example, in | ||
a CUDA kernel, a CUDA multi-reduction policy must be used. | ||
* Each RAJA multi-reduction type accepts an **initial reduction value or | ||
values** at construction (see below). | ||
* Each RAJA multi-reduction type has a 'get' method to access reduced | ||
values after kernel execution completes. | ||
|
||
Please see the following sections for a description of reducers: | ||
|
||
* :ref:`feat-reductions-label`. | ||
|
||
Please see the following cook book sections for guidance on policy usage: | ||
|
||
* :ref:`cook-book-multi-reductions-label`. | ||
|
||
|
||
-------------------- | ||
MultiReduction Types | ||
-------------------- | ||
|
||
RAJA supports three common multi-reduction types: | ||
|
||
* ``MultiReduceSum< multi_reduce_policy, data_type >`` - Sum of values. | ||
|
||
* ``MultiReduceMin< multi_reduce_policy, data_type >`` - Min value. | ||
|
||
* ``MultiReduceMax< multi_reduce_policy, data_type >`` - Max value. | ||
|
||
and two less common bitwise multi-reduction types: | ||
|
||
* ``MultiReduceBitAnd< multi_reduce_policy, data_type >`` - Bitwise 'and' of values (i.e., ``a & b``). | ||
|
||
* ``MultiReduceBitOr< multi_reduce_policy, data_type >`` - Bitwise 'or' of values (i.e., ``a | b``). | ||
|
||
.. note:: ``RAJA::MultiReduceBitAnd`` and ``RAJA::MultiReduceBitOr`` reduction types are designed to work on integral data types because **in C++, at the language level, there is no such thing as a bitwise operator on floating-point numbers.** | ||
|
||
----------------------- | ||
MultiReduction Examples | ||
----------------------- | ||
|
||
Next, we provide a few examples to illustrate basic usage of RAJA multi-reduction | ||
types. | ||
|
||
Here is a simple RAJA multi-reduction example that shows how to use a sum | ||
multi-reduction type:: | ||
|
||
const int N = 1000; | ||
const int B = 10; | ||
|
||
// | ||
// Initialize an array of length N with all ones, and another array to | ||
// integers between 0 and B-1 | ||
// | ||
int vec[N]; | ||
int bins[N]; | ||
for (int i = 0; i < N; ++i) { | ||
vec[i] = 1; | ||
bins[i] = i % B; | ||
} | ||
|
||
// Create a sum multi-reduction object with a size of B, and initial | ||
// values of zero | ||
RAJA::MultiReduceSum< RAJA::omp_multi_reduce, int > vsum(B, 0); | ||
|
||
// Run a kernel using the multi-reduction object | ||
RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N), | ||
[=](RAJA::Index_type i) { | ||
|
||
vsum[bins[i]] += vec[i]; | ||
|
||
}); | ||
|
||
// After kernel is run, extract the reduced values | ||
int my_vsums[B]; | ||
for (int bin = 0; bin < B; ++bin) { | ||
my_vsums[bin] = vsum[bin].get(); | ||
} | ||
|
||
The results of these operations will yield the following values: | ||
|
||
* my_vsums[0] == 100 | ||
* my_vsums[1] == 100 | ||
* my_vsums[2] == 100 | ||
* my_vsums[3] == 100 | ||
* my_vsums[4] == 100 | ||
* my_vsums[5] == 100 | ||
* my_vsums[6] == 100 | ||
* my_vsums[7] == 100 | ||
* my_vsums[8] == 100 | ||
* my_vsums[9] == 100 | ||
|
||
|
||
Here is the same example but using values stored in a container:: | ||
|
||
const int N = 1000; | ||
const int B = 10; | ||
|
||
// | ||
// Initialize an array of length N with all ones, and another array to | ||
// integers between 0 and B-1 | ||
// | ||
int vec[N]; | ||
int bins[N]; | ||
for (int i = 0; i < N; ++i) { | ||
vec[i] = 1; | ||
bins[i] = i % B; | ||
} | ||
|
||
// Create a vector with a size of B, and initial values of zero | ||
std::vector<int> my_vsums(B, 0); | ||
|
||
// Create a multi-reducer initalized with size and values from my_vsums | ||
RAJA::MultiReduceSum< RAJA::omp_multi_reduce, int > vsum(my_vsums); | ||
|
||
// Run a kernel using the multi-reduction object | ||
RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N), | ||
[=](RAJA::Index_type i) { | ||
|
||
vsum[bins[i]] += vec[i]; | ||
|
||
}); | ||
|
||
// After kernel is run, extract the reduced values back into my_vsums | ||
vsum.get_all(my_vsums); | ||
|
||
The results of these operations will yield the following values: | ||
|
||
* my_vsums[0] == 100 | ||
* my_vsums[1] == 100 | ||
* my_vsums[2] == 100 | ||
* my_vsums[3] == 100 | ||
* my_vsums[4] == 100 | ||
* my_vsums[5] == 100 | ||
* my_vsums[6] == 100 | ||
* my_vsums[7] == 100 | ||
* my_vsums[8] == 100 | ||
* my_vsums[9] == 100 | ||
|
||
|
||
|
||
|
||
|
||
Here is an example of a bitwise-or multi-reduction:: | ||
|
||
const int N = 128; | ||
const int B = 8; | ||
|
||
// | ||
// Initialize an array of length N to integers between 0 and B-1 | ||
// | ||
int bins[N]; | ||
for (int i = 0; i < N; ++i) { | ||
bins[i] = i % B; | ||
} | ||
|
||
// Create a bitwise-or multi-reduction object with initial value of '0' | ||
RAJA::MultiReduceBitOr< RAJA::omp_multi_reduce, int > vor(B, 0); | ||
|
||
// Run a kernel using the multi-reduction object | ||
RAJA::forall<RAJA::omp_parallel_for_exec>( RAJA::RangeSegment(0, N), | ||
[=](RAJA::Index_type i) { | ||
|
||
vor[bins[i]] |= i; | ||
|
||
}); | ||
|
||
// After kernel is run, extract the reduced values | ||
int my_vors[B]; | ||
for (int bin = 0; bin < B; ++bin) { | ||
my_vors[bin] = vor[bin].get(); | ||
} | ||
|
||
The results of these operations will yield the following values: | ||
|
||
* my_vors[0] == 120 == 0b1111000 | ||
* my_vors[1] == 121 == 0b1111001 | ||
* my_vors[2] == 122 == 0b1111010 | ||
* my_vors[3] == 123 == 0b1111011 | ||
* my_vors[4] == 124 == 0b1111100 | ||
* my_vors[5] == 125 == 0b1111101 | ||
* my_vors[6] == 126 == 0b1111110 | ||
* my_vors[7] == 127 == 0b1111111 | ||
|
||
The results of the multi-reduction start at 120 and increase to 127. In binary | ||
representation (i.e., bits), :math:`120 = 0b1111000` and :math:`127 = 0b1111111`. | ||
The bins were picked in such a way that all the integers in a bin had the same | ||
remainder modulo 8 so their last 3 binary digits were all the same while their | ||
upper binary digits varied. Because bitwise-or keeps all the set bits, the upper | ||
bits are all set because at least one integer in that bin set them. The last | ||
3 bits were the same in all the integers so the last 3 bits are the same as the | ||
remainder modulo 8 of the bin number. | ||
|
||
----------------------- | ||
MultiReduction Policies | ||
----------------------- | ||
|
||
For more information about available RAJA multi-reduction policies and guidance | ||
on which to use with RAJA execution policies, please see | ||
:ref:`multi-reducepolicy-label`. |
Oops, something went wrong.