Added design document for global reductions/collectives (E3SM-Project#11

) * Added design document for global reductions/collectives The design describes reproducible global sums and global min/max functionality.
philipwjones · May 31, 2023 · d028d14 · d028d14
1 parent 9b81f62
commit d028d14
Show file tree

Hide file tree

Showing 2 changed files with 295 additions and 0 deletions.
diff --git a/components/omega/doc/design/Reductions.md b/components/omega/doc/design/Reductions.md
@@ -0,0 +1,294 @@
+<!--- OMEGA global reductions requirements and design ------------------------->
+
+(omega-design-global-reductions)=
+
+
+# Global Reductions
+
+## 1 Overview
+
+In a parallel application like OMEGA, it is often necessary to perform
+averages, sums or other global collective operations on the distributed
+data. These reductions are described here.
+
+## 2 Requirements
+
+### 2.1 Requirement: Global sums
+
+The ability to perform the sum of all elements of a distributed array
+is required 
+
+### 2.2 Requirement: Global min/max
+
+The ability to find the minimum or maximum value in a distributed
+array is required.
+
+### 2.3 Requirement: Data types
+
+There must be reductions for all supported arithmetic types
+(integers, floating point) and all array dimensions from
+0-d (scalars) to 5-d.
+
+### 2.4 Requirement: Sums with product
+
+In many cases, a global sum is often performed on a product of two
+arrays. For example, a sum can be performed with a multiplicative
+mask or a sum may be a dot product of two vectors within a solver.
+A global sum interface that multiplies two arrays before the sum
+is required.
+
+### 2.5 Requirement: Restricted address space
+
+In order to compute reductions over only owned and active 
+cells/edges or for regional diagnostics, there must be an option
+to restrict the address space over which the local portion of sums
+are computed.
+
+### 2.6 Requirement: Multiple fields/arrays
+
+For performance reasons, it is often beneficial to compute reductions
+for multiple arrays at the same time to reduce the number of
+messages and associated message latency. An interface for computing
+independent reductions for multiple arrays at once is needed. Note
+that the size of the arrays and any associated options must be the
+same for all arrays in a multi-field reduction. It would be possible
+to relax this restriction in the future.
+
+### 2.7 Requirement: Other environments/decompositions
+
+Because OMEGA supports multiple machine environments (communicators
+and processor sets) and decompositions, the reductions must be
+callable from environments other than the default.
+
+### 2.8 Requirement: Reproducibility
+
+The results (esp sums) must be bit-for-bit reproducible on a given
+machine with respect to processor and thread count
+
+### 2.9 Requirement: Accelerators and location
+
+When running on accelerated architectures, the ability to specify
+whether the reduction occurs on host or device is needed.
+
+### 2.10 Requirement: CPU threading
+
+For reductions performed on the CPU, the operations must support
+and MPI+OpenMP model while still meeting reproducibility requirements.
+
+### 2.11 Desired: Other reduction operations
+
+It will be desireable in the future to add other reduction operations
+like minloc/maxloc.
+
+## 3 Algorithmic Formulation
+
+For integers, the sum is straightforward. For single-precision floating
+point, we will perform sums in double precision and convert back to
+single before returning. In each of the above cases, the local YAKL
+versions of sum, minval, maxval will be used for the local sum before
+accumulating the MPI sum across ranks. 
+
+For double-precision sums, we will begin with
+a version of the double-double (DDPDD) algorithm of Donald Knuth,
+further improved by David A Bailey and outlined in He and Ding (2001
+J. of Supercomputing, 18, 259). In the future, we will also support
+the more robust version by Pat Worley in the E3SM reprosum routines.
+
+## 4 Design
+
+### 4.1 Data types and parameters
+
+#### 4.1.1 Parameters 
+
+In the future, if we support multiple reproducible algorithms,
+we will need an enum to define choice of algorithm.
+
+#### 4.1.2 Class/structs/data types
+
+There are no new classes or data types associated with this
+functionality.
+
+### 4.2 Methods
+
+#### 4.2.1 Global sum
+
+For each supported array type ArrayTTDD (where TT is the data
+type I4, I8, R4, R8, Real and DD is the dimension 1D thru 5D)
+or supported scalar data type, there will be a sum function
+aliased to the globalSum interface:
+
+```c++
+    sum = globalSum(const ArrayTTDD array, 
+                    const I4 mpiComm, 
+                    const std::vector<I4> indxRange);
+```
+
+where mpiComm is the MPI communicator extracted from the
+relevant MachEnv (eg defaultEnv.comm) and indxRange is a
+`std::vector` of length 2x the number of array dimensions.
+The entries of this vector will be the min, max index of
+each array dimension. So, for example an array dimensioned
+`(nCellsAll+1, nVertLevels+1)` might need to be summed over
+the `indxRange{0, nCellsOwned-1, 0, nVertLevels-1}`. Note
+that the indxRange supports a constant scalar values only so
+does not support a variable index range like
+minLevelCell/maxLevelCell. That is best managed either
+by masking with the sum-product interface below or 
+ensuring the array has been set to zero in non-active
+entries.
+
+For scalar sums, the interface is the same with the 
+indxRange argument absent.
+
+We will assume that YAKL host arrays will be summed on the
+host and device arrays will be summed on the device.
+
+#### 4.2.2 Global sum with product
+
+There will be a similar interface for a sum with product:
+
+```c++
+    sum = globalSum(const ArrayTTDD array1,
+                    const ArrayTTDD array2, 
+                    const I4 mpiComm, 
+                    const std::vector<I4> indxRange);
+```
+
+that returns the sum of `array1*array2`. The two arrays
+must be of the same data type. It is not required that
+the two arrays be of exactly the same size but both arrays
+must be of appropriate size to accomodate the indxRange
+and the indices must align appropriately. This typically
+means that if the sizes differ, any extra entries occur
+at the end of the respective dimension.
+
+#### 4.2.3 Global sum multi-field
+
+To sum multiple fields at once, a `std::vector` of
+arrays must be constructed and the result will be
+stored in a `std::vector` of appropriate type:
+
+```c++
+    std::vector<type> sum = 
+         globalSum(const std::vector<ArrayTTDD> arrays,
+                   const I4 mpiComm,
+                   const std::vector<I4> indxRange)
+```
+
+As is the case with sum-product, each array must be
+of the same data type but can have slightly different
+sizes as long as the dimension lengths can accomodate
+the indxRange and that the indices align appropriately
+with the indxRange requested.
+
+#### 4.2.4 Global sum multi-field with product
+
+The multi-field sum with product is still to be determined, but
+should have an interface like:
+
+```c++
+    std::vector<type> sum = 
+         globalSum(const std::vector<ArrayTTDD> arrays1,
+                   const std::vector<ArrayTTDD> arrays2,
+                   const I4 mpiComm,
+                   const std::vector<I4> indxRange)
+```
+
+The implementation detail that needs to be determined is how
+to manage two use cases. A typical use case is for the second
+array in the product to be the same for all fields (eg a mask or
+an area array), but we may wish to support a case where each
+array product has a different array for both operands.
+Because YAKL arrays contain metadata and a pointer to
+the data, it may be possible to create a std::vector of
+the same array, allowing both the case of a fixed array
+to be used for all or for all the array products to have 
+different arrays. This requires some testing of approaches.
+
+#### 4.2.5 Global minval
+
+The global minval will support interfaces similar to the
+global sums, but will return the minimum value instead.
+
+```c++
+    varMin = globalMinval(const ArrayTTDD array, 
+                          const I4 mpiComm, 
+                          const std::vector<I4> indxRange);
+
+    varMin = globalMinval(const ArrayTTDD array1,
+                          const ArrayTTDD array2, 
+                          const I4 mpiComm, 
+                          const std::vector<I4> indxRange);
+
+    std::vector<type> varMin = 
+         globalMinval(const std::vector<ArrayTTDD> arrays,
+                      const I4 mpiComm,
+                      const std::vector<I4> indxRange)
+```
+
+Note that the minval routine will not be cognizant of any
+special values so any inactive entries should either be
+masked or set to an appropriately high value.
+
+#### 4.2.6 Global maxval
+
+The maxval will have the identical form of minval but will
+return the maximum value. The same restrictions and comments
+on special values apply.
+
+## 5 Verification and Testing
+
+### 5.1 Test basics
+
+For each data type, a set of reference arrays will be created
+with random numbers that extend across the entire range of that 
+type (for later reproducibility tests). A reference sum will
+be computed serially using the same reproducible algorithm
+as the global implementation. The basic test will compare the
+global sum with the reference serial sum. Because the data types
+include both host and device arrays, this will also test reductions
+on either location. 
+  - tests requirement 2.1, 2.3, part of 2.8, 2.9
+
+### 5.2 Reproducibility
+
+A MachEnv and decomposition for a subset of MPI ranks will be
+created and the reference arrays above will be distributed across
+the new decomposition. Global sums in each case will be compared
+to the serial reference value and the full MPI rank case
+for bit reproducibility. Similarly, a test with CPU threading
+on will test reproducibility under threading.
+  - tests requirement 2.7, 2.8, 2.10
+
+### 5.3 Restricted index range
+
+The sums will be tested with a reduced index range on the same
+reference arrays from above and compared with the 
+similarly-restricted serial reference sums.
+  - tests requirement 2.5
+
+### 5.4 Sum with product
+
+A mask array will be defined for each type and used with the
+sum-with-product interface and compared against a serial
+version of the same.
+  - tests requirement 2.4
+
+### 5.5 Multi-field sums
+
+Additional arrays similar to the reference arrays will be created
+with reference serial sums. Then the multi-field interface will
+be tested against these reference sums.
+  - tests requirement 2.6
+
+### 5.6 Min/Max
+
+A similar approach to the sums above will be used to test the
+min and max functions in all combinations of interfaces.
+  - tests requirement 2.2
+
+
+
+
+
+
diff --git a/components/omega/doc/index.md b/components/omega/doc/index.md
@@ -44,6 +44,7 @@ design/Halo
 design/Logging
 design/MachEnv
 design/TimeMgr
+design/Reductions
 
 design/OmegaDesignTemplate
 ```