Hybrid parallel (OpenMP) implementation for Linear Algebra classes #830

pcarruscag · 2019-12-02T19:11:14Z

Proposed Changes

The first work item from #824 is kind of in place code-wise (lots of testing required).
I will use this PR to document the implementation, while continuing the work on the main PR, and to keep the discussion "encapsulated".

Related Work

#789, #824

PR Checklist

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
My contribution is commented and consistent with SU2 style.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

…allel_and_SIMD

…ILU, and LU_SGS

…ACOBI

…rnings in non-threaded compilation

…ed Linelet

pcarruscag · 2019-12-02T19:27:09Z

Common/include/geometry/CGeometry.hpp

-   * \brief A virtual member.
-   * \return Total number of nodes in a simulation across all processors (including halos).
-   */
-  inline virtual unsigned long GetGlobal_nPoint() const { return 0; }
-
-  /*!
-   * \brief A virtual member.
-   * \return Total number of nodes in a simulation across all processors (excluding halos).
-   */
-  inline virtual unsigned long GetGlobal_nPointDomain() const { return 0; }
-
-  /*!
-   * \brief A virtual member.
-   * \param[in] val_global_npoint - Global number of points in the mesh (excluding halos).
-   */
-  inline virtual void SetGlobal_nPointDomain(unsigned long val_global_npoint) {}
-
-  /*!
-   * \brief A virtual member.


This still contains a bit of CGeometry clean up, basically a lot of small methods that did not need to be virtual (get number of prisms and so on, a lot of these are only used in the legacy output) and a lot (all?) of the small turbomachinery set/get methods.

pcarruscag · 2019-12-02T19:30:32Z

Common/include/geometry/CGeometry.hpp

+  CCompressedSparsePatternUL
+  finiteVolumeCSRFill0,                  /*!< \brief 0-fill FVM sparsity. */
+  finiteVolumeCSRFillN,                  /*!< \brief N-fill FVM sparsity (e.g. for ILUn preconditioner). */
+  finiteElementCSRFill0,                 /*!< \brief 0-fill FEM sparsity. */
+  finiteElementCSRFillN;                 /*!< \brief N-fill FEM sparsity (e.g. for ILUn preconditioner). */


As I mentioned in #789 the sparsity patterns (row ptr and col idx) that CSysMatrix requires are now stored in CGeometry, this is to allow re-use (bulk and turbulence for example) and to amortise the "edge map" structure (of comparable size) I introduced to accelerate the update of CSysMatrix blocks.

pcarruscag · 2019-12-02T19:32:37Z

Common/include/geometry/CGeometry.hpp

+  CCompressedSparsePatternUL
+  edgeColoring,                          /*!< \brief Edge coloring structure for thread-based parallelization. */
+  elemColoring;                          /*!< \brief Element coloring structure for thread-based parallelization. */
+  unsigned long edgeColorGroupSize = 1;  /*!< \brief Size of the edge groups within each color. */
+  unsigned long elemColorGroupSize = 1;  /*!< \brief Size of the element groups within each color. */


This is not yet used by this PR, these colorings will allow parallelizing edge/element loops using threads, more on that later.

pcarruscag · 2019-12-02T19:45:07Z

Common/include/linear_algebra/CSysMatrix.hpp

-  ScalarType *block;             /*!< \brief Internal array to store a subblock of the matrix. */
-  ScalarType *block_inverse;     /*!< \brief Internal array to store a subblock of the matrix. */
-  ScalarType *block_weight;      /*!< \brief Internal array to store a subblock of the matrix. */
-  ScalarType *prod_row_vector;   /*!< \brief Internal array to store the product of a matrix-by-blocks "row" with a vector. */
-  ScalarType *aux_vector;        /*!< \brief Auxiliary array to store intermediate results. */
-  ScalarType *sum_vector;        /*!< \brief Auxiliary array to store intermediate results. */
+
+  enum : size_t { MAXNVAR = 8 };    /*!< \brief Maximum number of variables the matrix can handle. The static
+                                                size is needed for fast, per-thread, static memory allocation. */


First drawback of using threads, we need to be mindful of the thread-safety of the routines, the small working structures we had (block and so on) cannot be used by multiple threads simultaneously, the alternatives are:

Allocate a larger chunk of memory and distribute it by the threads, this is a bit ugly and unnatural in small parallel for constructs, but reasonable for section where each thread does a lot of work.

Local dynamic allocation, which would hurt the performance of light routines and prevent optimizations.

Local static allocation, which hurts generality for example, as the name MAXNVAR implies if the matrix is asked to work on more than that bad things will happen (a runtime error is thrown that the code should be compiled with a larger number).

pcarruscag · 2019-12-02T19:48:59Z

Common/include/linear_algebra/CSysMatrix.hpp

+  /*!
+   * \brief Update 4 blocks ii, ij, ji, jj (add to i* sub from j*).
+   * \note The template parameter Sign, can be used create a "subtractive"
+   *       update i.e. subtract from row i and add to row j instead.
+   * \param[in] edge - Index of edge that connects iPoint and jPoint.
+   * \param[in] iPoint - Row to which we add the blocks.
+   * \param[in] jPoint - Row from which we subtract the blocks.
+   * \param[in] block_i - Adds to ii, subs from ji.
+   * \param[in] block_j - Adds to ij, subs from jj.
+   */
+  template<class OtherType, int Sign = 1>
+  inline void UpdateBlocks(unsigned long iEdge, unsigned long iPoint, unsigned long jPoint,
+                           OtherType **block_i, OtherType **block_j) {
+
+    ScalarType *bii = &matrix[dia_ptr[iPoint]*nVar*nEqn];
+    ScalarType *bjj = &matrix[dia_ptr[jPoint]*nVar*nEqn];
+    ScalarType *bij = &matrix[edge_ptr(iEdge,0)*nVar*nEqn];
+    ScalarType *bji = &matrix[edge_ptr(iEdge,1)*nVar*nEqn];
+
+    unsigned long iVar, jVar, offset = 0;
+
+    for (iVar = 0; iVar < nVar; iVar++) {
+      for (jVar = 0; jVar < nEqn; jVar++) {
+        bii[offset] += PassiveAssign<ScalarType,OtherType>(block_i[iVar][jVar]) * Sign;
+        bij[offset] += PassiveAssign<ScalarType,OtherType>(block_j[iVar][jVar]) * Sign;
+        bji[offset] -= PassiveAssign<ScalarType,OtherType>(block_i[iVar][jVar]) * Sign;
+        bjj[offset] -= PassiveAssign<ScalarType,OtherType>(block_j[iVar][jVar]) * Sign;
+        ++offset;
+      }
+    }
+  }
+


This is the "fast" block update I mentioned, instead of looking for the indices of the blocks in the sparse structure we use the edge pointer to directly obtain them. This edge map is populated once (via the normal search process).
This is now being called by all FVM solvers (for FEM this would not pay off as many blocks are referenced by each element).

pcarruscag · 2019-12-02T20:16:26Z

Common/src/linear_algebra/CSysMatrix.cpp

+  const auto& csr = geometry->GetSparsePattern(type,0);

-  }
+  row_ptr = csr.outerPtr();
+  col_ind = csr.innerIdx();
+  dia_ptr = csr.diagPtr();
+  nnz = csr.getNumNonZeros();


This is the pattern coming from the CGeometry associated with the matrix, CGeometry does lazy construction of those patterns so there is no extra work or complicate logic on the config settings to determine types of solver etc.

pcarruscag · 2019-12-02T20:18:01Z

Common/src/linear_algebra/CSysMatrix.cpp

+  SU2_OMP_PAR_FOR_STAT(OMP_STAT_SIZE)
+  for (unsigned long index = 0; index < nnz*nVar*nEqn; index++)
+    matrix[index] = 0.0;


Simple example of the use of one of those "macro encapsulated" pragmas.

pcarruscag · 2019-12-02T20:22:33Z

Common/src/linear_algebra/CSysMatrix.cpp

+  SU2_OMP_PAR_FOR_DYN(omp_chunk_size)
+  for (auto row_i = 0ul; row_i < nPointDomain; row_i++) {
+    auto prod_begin = row_i*nVar; // offset to beginning of block row_i
+    for(auto iVar = 0ul; iVar < nVar; iVar++)
+      prod[prod_begin+iVar] = 0.0;
+    for (auto index = row_ptr[row_i]; index < row_ptr[row_i+1]; index++) {
+      auto vec_begin = col_ind[index]*nVar; // offset to beginning of block col_ind[index]
+      auto mat_begin = index*nVar*nVar; // offset to beginning of matrix block[row_i][col_ind[indx]]


For non-transposed matrix multiplication this is all we need to make it parallel, it is however not ideal as I will explain in a bit. Note all local variables, if those indexes were declared outside the loop we could have problems.

pcarruscag · 2019-12-02T20:24:36Z

Common/src/linear_algebra/CSysMatrix.cpp

+  SU2_OMP_PARALLEL_ON(omp_num_parts)
+  {
+  int thread = omp_get_thread_num();
+  const auto begin = omp_partitions[thread];
+  const auto end = omp_partitions[thread+1];
+
+  ScalarType weight[MAXNVAR*MAXNVAR], aux_block[MAXNVAR*MAXNVAR];
+
+  for (auto iPoint = begin+1; iPoint < end; iPoint++) {


For preconditioners that are serial by nature we have a parallel section where each thread works on its own large chunk of the matrix.

pcarruscag · 2019-12-02T20:25:36Z

SU2_CFD/src/solver_direct_heat.cpp

-        /*--- Points in edge ---*/
-        iPoint = geometry->edge[iEdge]->GetNode(0);
-        jPoint = geometry->edge[iEdge]->GetNode(1);
-        numerics->SetNormal(geometry->edge[iEdge]->GetNormal());
+      /*--- Points in edge ---*/
+      iPoint = geometry->edge[iEdge]->GetNode(0);
+      jPoint = geometry->edge[iEdge]->GetNode(1);
+      numerics->SetNormal(geometry->edge[iEdge]->GetNormal());


Few indentation issues fixed here and there too.

pcarruscag · 2019-12-02T20:44:04Z

I have covered all operations used in non adjoint use, the non ideal part of the implementation I mentioned above is that the parallelization is "local", i.e. we get to the operation we want to make parallel and launch the threads there, for simple vector-vector operations the overhead may be significant.
Ideally we would have a parallel construct at a higher level, say CSysSolve::Solve, so that the threads are already in flight when we get to those small operations.
In principle it is not too hard to do that, but it needs to be done carefully especially when the execution gets to an MPI part of the code (which thread(s) communicate, etc.).
I will try to benchmark this to put numbers on the performance / simplicity trade-off.

…o add sparse matrices

pcarruscag · 2019-12-06T16:41:07Z

Ok the "simple" version of "going parallel" whenever we get to a linear algebra operation did not make the cut.
On an older architecture there was a 10% slowdown of the linear solvers at ~10k nodes per core and about the same on a newer architecture but only at ~1k node per core.
Since hybrid parallel is supposed to be good for strong scaling, this was not good enough... With the new strategy it is ok (see "performance" below), hence this is ready for review.

Overall Strategy

The strategy now is to start a parallel section in CSysSolve::Solve that covers building the preconditioner and solving the linear system.
Linear algebra routines called within this section have worksharing constructs instead of parallel ones, i.e. the work is distributed by however many threads arrive to that routine. This also makes the routines safe to call in serial.
The only "dangerous" things to do in parallel are to: manage memory for a shared object (multiple threads call new but there is only one shared pointer on which to call delete); writing to the same memory locations concurrently.
I tried to make the first issue debugable by asserting that the initialization routines of CSysMatrix and CSysVector are only called by the master thread.
For the second issue I made the associated classes as const-correct as possible, that should at least make someone think twice before changing a member variable of those classes. The risk is still there for input variables as an algorithm development aspect... For example MatrixVectorProductTransposed cannot be made thread-parallel as simply/naively as its normal counterpart.

Communication Model

The MPI + Threads communication model is very simple, currently only the master thread calls MPI routines (including Error), this requires thread barriers before and after the communication to make sure the correct values are passed and seen by all threads.
We can test other alternatives in the future but at the moment this does not seem to be a significant bottleneck.
Worksharing constructs have implicit barriers at completion, for CSysVector routines I used nowait modifiers, it is safe to call those routines in sequence since the loop sizes, and static work scheduling specifications are identical.
However, routines that access a CSysVector in a different way, should have an explicit barrier before using the vector (or risk having undefined behaviour). You will see these barriers on entry to matrix-vector product, and every ComputeXXXPreconditioner (if you don't, let me know xD). I think those routines are large enough to amortise the cost of this.

Performance

Disclaimer:

We are talking about linear solvers only, you will not see a global improvement yet.
The large global improvements from "hybridization" will come from the multigrid behaving better on less decomposed domains, and from the ability to independently tune the number of cores used in the linear preconditioners. For now the objective is "just" not to loose performance while gaining flexibility.
The performance of MPI+threads with 1 thread per rank will be worse than just MPI (no free lunches).

With this small case using 8 cores off a machine with two 2650v4 CPU, Intel MPI 2018 + GCC 8.2, the hybrid (2 ranks of 4 threads) approach is about 5% faster thank the MPI-only (8 ranks), I expect larger cases to have identical performance.

How To

Compile: Add -fopenmp to the compiler and linker arguments.
Run: Set number of threads with env variable OMP_NUM_THREADS (eventually I will make that a command line parameter), for best performance set OMP_WAIT_POLICY=ACTIVE and beware of thread binding settings, use mpirun --bind-to socket or mpirun --bind-to numa never core.

pcarruscag · 2019-12-06T16:44:29Z

Common/src/linear_algebra/CSysVector.cpp

+ScalarType CSysVector<ScalarType>::dot(const CSysVector<ScalarType> & u) const {
+#if !defined(CODI_FORWARD_TYPE) && !defined(CODI_REVERSE_TYPE)


Readability wise, the dot product operation (now a member of CSysVector) is as bad as it can get as we need to perform a reduction over threads and ranks.

economon · 2019-12-17T16:33:24Z

Following along here.. will you eventually move the parallel section in CSysSolve up to the full application level (in the main() maybe) when you move to the next steps? In the earlier work, we found that, as you have seen in the linear solver routines, spawning parallel sections kernel-wise carried a large overhead. We found that the best performance was given by spawning right at the start and carry the threads through the entire program, just like the MPI ranks.

My only other comments, which it sounds like you are addressing, are to make the threads as transparent as possible to developers (shouldn't need to touch them unless they want to, like the MPI), and to make the compilation painless (disable/enable). Have you connected it to meson somehow yet?

economon · 2019-12-17T17:02:07Z

I should also mention though that moving the threading to a single high-level parallel section is also very problematic for readability/development. Folks will have to be aware that the threads are active, and it can be very error-prone. This was one of the major detractors of implementing the OpenMP framework as we had it in the C&F paper into the develop branch, even though the performance was quite good (and also the interoperability of threading and AD at the time). Any clever suggestions/techniques for hiding the threading as much as possible are most welcome.

pcarruscag · 2019-12-17T17:33:30Z

No meson option yet, it is a very small one though, the whole system works by detecting -fopenmp.
Your second comment is the main argument against moving the parallel section further up.
Allocation routines have the highest risk of making a mess, but even seemingly innocuous things like the small auxiliary arrays we allocate e.g. in CSolver and then use in derived classes are a problem.
I am almost done making the FEA solver completely hybrid parallel and I had to refactor most uses of those arrays. This is also why I took a more functional approach to the new limiter and gradient routines. The way we use CConfig is also not thread safe, we would need to make all "SetSomethings" atomic, which would be monumental.
Initially I would have a few parallel sections (it is not too difficult to then move them up if we think that is the way to go) I want to use the FEA solver to get an idea for the relative performance, after seeing the effect of OMP_WAIT_POLICY I am optimistic.

…linear_algebra

talbring

Thanks @pcarruscag ! I like how this is implemented. The OMP structure does not really lead to a lot of overhead in terms of readability. I am curious to see what else we can do with this.

talbring · 2019-12-16T08:44:26Z

Common/include/mpi_structure.inl

+inline void CMediMPIWrapper::Init_thread(int *argc, char ***argv, int required, int* provided) {
+  AMPI_Init_thread(argc,argv,required,provided);
+  MediTool::init();
+  AMPI_Comm_rank(convertComm(currentComm), &Rank);    
+  AMPI_Comm_size(convertComm(currentComm), &Size);  
+
+  MinRankError = Size;
+  MPI_Win_create(&MinRankError, sizeof(int), sizeof(int), MPI_INFO_NULL,
+                 currentComm, &winMinRankError);
+  winMinRankErrorInUse = true;
+}
+


That seems to be alright as long as either Init_thread or Init is called, and not both.

talbring · 2020-01-07T13:37:36Z

Common/src/linear_algebra/CSysMatrix.cpp

+   *    the data is managed by CGeometry to allow re-use. ---*/

-  }
+  const auto& csr = geometry->GetSparsePattern(type,0);


I don't know what your plans are, but in the future we should then try to pass the parsity pattern to the constructor in order to remove the dependencies of the matrix class on the geometry class

That would be nice, but we still need the geometry for MPI.

Actually I think we can move the point-to-point communication to its own structure. What do you think @economon ?

pcarruscag · 2020-01-13T15:08:58Z

@talbring I think GitHub does not recognize the "approved" label as an approval (the bot put it back to un-reviewed).

…linear_algebra

pcarruscag · 2020-01-13T15:22:00Z

I think this is stable enough to be merged, while testing #834 and #843 I think I went through most combinations of solvers/preconditioners and things seem to work fine, the only source of problems was the dot product (it was locking in some conditions) but all seems fine now.

talbring · 2020-01-13T22:08:18Z

@talbring I think GitHub does not recognize the "approved" label as an approval (the bot put it back to un-reviewed).

Hm yeah, the label will be removed if there has been a new commit ... unfortunately there is no way to change that ...

pcarruscag and others added 17 commits November 30, 2019 10:13

add toolbox for operations on graphs and sparse patterns

d75a998

Merge branch 'restructure_geometry_structure' into feature_hybrid_par…

95e3ebb

…allel_and_SIMD

clean up methods that do need to be virtual

5c49220

move sparse structure creation out of CSysMatrix

f468c60

use the diagonal pointer in CSysMatrix to reduce branches in Jacobi, …

9af36b4

…ILU, and LU_SGS

use edge map to avoid linear searches when updating CSysMatrix blocks

bdb09fc

generic coloring function, edge and element coloring added to CGeometry

6c97d14

thread-parallel CSysMatrix, non-transpose mat mul, LU_SGS, ILU, and J…

daf054f

…ACOBI

fix serial compilation issue

8e754ba

add convenience function to allocation toolbox

8a67ecf

prune CSysVector add thread-parallel to pastix wrapper

192506d

some cleanup

63c86d2

Merge branch 'develop' into feature_hybrid_parallel_and_SIMD

05c2c0f

fix compilation error, reduce includes in CSysMatrix

8da4369

add mechanisms to OpenMP "wrapper" to disable pragmas and so avoid wa…

5abf142

…rnings in non-threaded compilation

add OpenMP pragmas to CSysVector

1baf16d

prevent #pragma warnings in CSysMatrix, const correctness, and thread…

cbed613

…ed Linelet

pr-triage bot added the PR: unreviewed label Dec 2, 2019

fix serial compilation issue

42a21cf

pcarruscag commented Dec 2, 2019

View reviewed changes

pcarruscag and others added 9 commits December 3, 2019 10:35

missing sizeof in CSysMatrix allocation

02162e0

fix LU_SGS parallel inconsistency

8b59e37

start moving to single parallel section defined in CSysSolve

ec918af

finish single parallel section started by CSysSolve

e4dd764

missing initializations of linear solver working arrays

7cfd482

fix AD compilation

0226f29

simplify dot product, update disc_adj_fea residuals

c2577dc

Merge branch 'develop' into hybrid_parallel_linear_algebra

405c92b

reduction of member variables not allowed by some compilers, method t…

88ec664

…o add sparse matrices

pcarruscag commented Dec 6, 2019

View reviewed changes

fix possible deadlock caused by dot product

acb604b

pcarruscag mentioned this pull request Dec 10, 2019

Hybrid parallel Gradients and Limiters #834

Merged

5 tasks

fully centralize allocations of CSysVector

efb0acf

pcarruscag mentioned this pull request Dec 21, 2019

Hybrid parallel CFEASolver and CMeshSolver #843

Merged

5 tasks

Merge remote-tracking branch 'upstream/develop' into hybrid_parallel_…

4c41ea5

…linear_algebra

talbring reviewed Jan 7, 2020

View reviewed changes

talbring added PR: reviewed-approved and removed PR: unreviewed labels Jan 7, 2020

pr-triage bot added PR: unreviewed and removed PR: reviewed-approved labels Jan 7, 2020

pcarruscag added 2 commits January 13, 2020 15:11

Merge remote-tracking branch 'upstream/develop' into hybrid_parallel_…

43e67fc

…linear_algebra

minor tweak to dot product

ebf5748

talbring approved these changes Jan 13, 2020

View reviewed changes

pr-triage bot added PR: reviewed-approved and removed PR: unreviewed labels Jan 13, 2020

talbring added the changelog:feature label Jan 13, 2020

pcarruscag merged commit 36c0646 into develop Jan 13, 2020

pr-triage bot added PR: merged and removed PR: reviewed-approved labels Jan 13, 2020

pcarruscag deleted the hybrid_parallel_linear_algebra branch January 21, 2020 17:37

pcarruscag mentioned this pull request Mar 11, 2020

Hybrid parallel coloring fallback strategies (better strong scaling and user friendliness) #908

Merged

5 tasks

jblueh mentioned this pull request Mar 1, 2021

Hybrid Parallel AD (Part 1/?) #1214

Merged

5 tasks

		ScalarType CSysVector<ScalarType>::dot(const CSysVector<ScalarType> & u) const {
		#if !defined(CODI_FORWARD_TYPE) && !defined(CODI_REVERSE_TYPE)

Hybrid parallel (OpenMP) implementation for Linear Algebra classes #830

Hybrid parallel (OpenMP) implementation for Linear Algebra classes #830

Uh oh!

Conversation

pcarruscag commented Dec 2, 2019

Proposed Changes

Related Work

PR Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcarruscag commented Dec 2, 2019

Uh oh!

pcarruscag commented Dec 6, 2019

Overall Strategy

Communication Model

Performance

How To

Uh oh!

Choose a reason for hiding this comment

Uh oh!

economon commented Dec 17, 2019

Uh oh!

economon commented Dec 17, 2019

Uh oh!

pcarruscag commented Dec 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

talbring left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcarruscag commented Jan 13, 2020

Uh oh!

pcarruscag commented Jan 13, 2020

Uh oh!

talbring commented Jan 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pcarruscag commented Dec 17, 2019 •

edited

Loading