forked from E3SM-Project/E3SM
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added design document for global reductions/collectives (E3SM-Project#11
) * Added design document for global reductions/collectives The design describes reproducible global sums and global min/max functionality.
- Loading branch information
1 parent
2c50fce
commit 86be118
Showing
2 changed files
with
295 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,294 @@ | ||
<!--- OMEGA global reductions requirements and design -------------------------> | ||
|
||
(omega-design-global-reductions)= | ||
|
||
|
||
# Global Reductions | ||
|
||
## 1 Overview | ||
|
||
In a parallel application like OMEGA, it is often necessary to perform | ||
averages, sums or other global collective operations on the distributed | ||
data. These reductions are described here. | ||
|
||
## 2 Requirements | ||
|
||
### 2.1 Requirement: Global sums | ||
|
||
The ability to perform the sum of all elements of a distributed array | ||
is required | ||
|
||
### 2.2 Requirement: Global min/max | ||
|
||
The ability to find the minimum or maximum value in a distributed | ||
array is required. | ||
|
||
### 2.3 Requirement: Data types | ||
|
||
There must be reductions for all supported arithmetic types | ||
(integers, floating point) and all array dimensions from | ||
0-d (scalars) to 5-d. | ||
|
||
### 2.4 Requirement: Sums with product | ||
|
||
In many cases, a global sum is often performed on a product of two | ||
arrays. For example, a sum can be performed with a multiplicative | ||
mask or a sum may be a dot product of two vectors within a solver. | ||
A global sum interface that multiplies two arrays before the sum | ||
is required. | ||
|
||
### 2.5 Requirement: Restricted address space | ||
|
||
In order to compute reductions over only owned and active | ||
cells/edges or for regional diagnostics, there must be an option | ||
to restrict the address space over which the local portion of sums | ||
are computed. | ||
|
||
### 2.6 Requirement: Multiple fields/arrays | ||
|
||
For performance reasons, it is often beneficial to compute reductions | ||
for multiple arrays at the same time to reduce the number of | ||
messages and associated message latency. An interface for computing | ||
independent reductions for multiple arrays at once is needed. Note | ||
that the size of the arrays and any associated options must be the | ||
same for all arrays in a multi-field reduction. It would be possible | ||
to relax this restriction in the future. | ||
|
||
### 2.7 Requirement: Other environments/decompositions | ||
|
||
Because OMEGA supports multiple machine environments (communicators | ||
and processor sets) and decompositions, the reductions must be | ||
callable from environments other than the default. | ||
|
||
### 2.8 Requirement: Reproducibility | ||
|
||
The results (esp sums) must be bit-for-bit reproducible on a given | ||
machine with respect to processor and thread count | ||
|
||
### 2.9 Requirement: Accelerators and location | ||
|
||
When running on accelerated architectures, the ability to specify | ||
whether the reduction occurs on host or device is needed. | ||
|
||
### 2.10 Requirement: CPU threading | ||
|
||
For reductions performed on the CPU, the operations must support | ||
and MPI+OpenMP model while still meeting reproducibility requirements. | ||
|
||
### 2.11 Desired: Other reduction operations | ||
|
||
It will be desireable in the future to add other reduction operations | ||
like minloc/maxloc. | ||
|
||
## 3 Algorithmic Formulation | ||
|
||
For integers, the sum is straightforward. For single-precision floating | ||
point, we will perform sums in double precision and convert back to | ||
single before returning. In each of the above cases, the local YAKL | ||
versions of sum, minval, maxval will be used for the local sum before | ||
accumulating the MPI sum across ranks. | ||
|
||
For double-precision sums, we will begin with | ||
a version of the double-double (DDPDD) algorithm of Donald Knuth, | ||
further improved by David A Bailey and outlined in He and Ding (2001 | ||
J. of Supercomputing, 18, 259). In the future, we will also support | ||
the more robust version by Pat Worley in the E3SM reprosum routines. | ||
|
||
## 4 Design | ||
|
||
### 4.1 Data types and parameters | ||
|
||
#### 4.1.1 Parameters | ||
|
||
In the future, if we support multiple reproducible algorithms, | ||
we will need an enum to define choice of algorithm. | ||
|
||
#### 4.1.2 Class/structs/data types | ||
|
||
There are no new classes or data types associated with this | ||
functionality. | ||
|
||
### 4.2 Methods | ||
|
||
#### 4.2.1 Global sum | ||
|
||
For each supported array type ArrayTTDD (where TT is the data | ||
type I4, I8, R4, R8, Real and DD is the dimension 1D thru 5D) | ||
or supported scalar data type, there will be a sum function | ||
aliased to the globalSum interface: | ||
|
||
```c++ | ||
sum = globalSum(const ArrayTTDD array, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange); | ||
``` | ||
|
||
where mpiComm is the MPI communicator extracted from the | ||
relevant MachEnv (eg defaultEnv.comm) and indxRange is a | ||
`std::vector` of length 2x the number of array dimensions. | ||
The entries of this vector will be the min, max index of | ||
each array dimension. So, for example an array dimensioned | ||
`(nCellsAll+1, nVertLevels+1)` might need to be summed over | ||
the `indxRange{0, nCellsOwned-1, 0, nVertLevels-1}`. Note | ||
that the indxRange supports a constant scalar values only so | ||
does not support a variable index range like | ||
minLevelCell/maxLevelCell. That is best managed either | ||
by masking with the sum-product interface below or | ||
ensuring the array has been set to zero in non-active | ||
entries. | ||
|
||
For scalar sums, the interface is the same with the | ||
indxRange argument absent. | ||
|
||
We will assume that YAKL host arrays will be summed on the | ||
host and device arrays will be summed on the device. | ||
|
||
#### 4.2.2 Global sum with product | ||
|
||
There will be a similar interface for a sum with product: | ||
|
||
```c++ | ||
sum = globalSum(const ArrayTTDD array1, | ||
const ArrayTTDD array2, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange); | ||
``` | ||
|
||
that returns the sum of `array1*array2`. The two arrays | ||
must be of the same data type. It is not required that | ||
the two arrays be of exactly the same size but both arrays | ||
must be of appropriate size to accomodate the indxRange | ||
and the indices must align appropriately. This typically | ||
means that if the sizes differ, any extra entries occur | ||
at the end of the respective dimension. | ||
|
||
#### 4.2.3 Global sum multi-field | ||
|
||
To sum multiple fields at once, a `std::vector` of | ||
arrays must be constructed and the result will be | ||
stored in a `std::vector` of appropriate type: | ||
|
||
```c++ | ||
std::vector<type> sum = | ||
globalSum(const std::vector<ArrayTTDD> arrays, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange) | ||
``` | ||
As is the case with sum-product, each array must be | ||
of the same data type but can have slightly different | ||
sizes as long as the dimension lengths can accomodate | ||
the indxRange and that the indices align appropriately | ||
with the indxRange requested. | ||
#### 4.2.4 Global sum multi-field with product | ||
The multi-field sum with product is still to be determined, but | ||
should have an interface like: | ||
```c++ | ||
std::vector<type> sum = | ||
globalSum(const std::vector<ArrayTTDD> arrays1, | ||
const std::vector<ArrayTTDD> arrays2, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange) | ||
``` | ||
|
||
The implementation detail that needs to be determined is how | ||
to manage two use cases. A typical use case is for the second | ||
array in the product to be the same for all fields (eg a mask or | ||
an area array), but we may wish to support a case where each | ||
array product has a different array for both operands. | ||
Because YAKL arrays contain metadata and a pointer to | ||
the data, it may be possible to create a std::vector of | ||
the same array, allowing both the case of a fixed array | ||
to be used for all or for all the array products to have | ||
different arrays. This requires some testing of approaches. | ||
|
||
#### 4.2.5 Global minval | ||
|
||
The global minval will support interfaces similar to the | ||
global sums, but will return the minimum value instead. | ||
|
||
```c++ | ||
varMin = globalMinval(const ArrayTTDD array, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange); | ||
|
||
varMin = globalMinval(const ArrayTTDD array1, | ||
const ArrayTTDD array2, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange); | ||
|
||
std::vector<type> varMin = | ||
globalMinval(const std::vector<ArrayTTDD> arrays, | ||
const I4 mpiComm, | ||
const std::vector<I4> indxRange) | ||
``` | ||
Note that the minval routine will not be cognizant of any | ||
special values so any inactive entries should either be | ||
masked or set to an appropriately high value. | ||
#### 4.2.6 Global maxval | ||
The maxval will have the identical form of minval but will | ||
return the maximum value. The same restrictions and comments | ||
on special values apply. | ||
## 5 Verification and Testing | ||
### 5.1 Test basics | ||
For each data type, a set of reference arrays will be created | ||
with random numbers that extend across the entire range of that | ||
type (for later reproducibility tests). A reference sum will | ||
be computed serially using the same reproducible algorithm | ||
as the global implementation. The basic test will compare the | ||
global sum with the reference serial sum. Because the data types | ||
include both host and device arrays, this will also test reductions | ||
on either location. | ||
- tests requirement 2.1, 2.3, part of 2.8, 2.9 | ||
### 5.2 Reproducibility | ||
A MachEnv and decomposition for a subset of MPI ranks will be | ||
created and the reference arrays above will be distributed across | ||
the new decomposition. Global sums in each case will be compared | ||
to the serial reference value and the full MPI rank case | ||
for bit reproducibility. Similarly, a test with CPU threading | ||
on will test reproducibility under threading. | ||
- tests requirement 2.7, 2.8, 2.10 | ||
### 5.3 Restricted index range | ||
The sums will be tested with a reduced index range on the same | ||
reference arrays from above and compared with the | ||
similarly-restricted serial reference sums. | ||
- tests requirement 2.5 | ||
### 5.4 Sum with product | ||
A mask array will be defined for each type and used with the | ||
sum-with-product interface and compared against a serial | ||
version of the same. | ||
- tests requirement 2.4 | ||
### 5.5 Multi-field sums | ||
Additional arrays similar to the reference arrays will be created | ||
with reference serial sums. Then the multi-field interface will | ||
be tested against these reference sums. | ||
- tests requirement 2.6 | ||
### 5.6 Min/Max | ||
A similar approach to the sums above will be used to test the | ||
min and max functions in all combinations of interfaces. | ||
- tests requirement 2.2 | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,6 +44,7 @@ design/Halo | |
design/Logging | ||
design/MachEnv | ||
design/TimeMgr | ||
design/Reductions | ||
design/OmegaDesignTemplate | ||
``` |