-
Notifications
You must be signed in to change notification settings - Fork 918
Hybrid parallel (OpenMP) implementation for Linear Algebra classes #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rnings in non-threaded compilation
| * \brief A virtual member. | ||
| * \return Total number of nodes in a simulation across all processors (including halos). | ||
| */ | ||
| inline virtual unsigned long GetGlobal_nPoint() const { return 0; } | ||
|
|
||
| /*! | ||
| * \brief A virtual member. | ||
| * \return Total number of nodes in a simulation across all processors (excluding halos). | ||
| */ | ||
| inline virtual unsigned long GetGlobal_nPointDomain() const { return 0; } | ||
|
|
||
| /*! | ||
| * \brief A virtual member. | ||
| * \param[in] val_global_npoint - Global number of points in the mesh (excluding halos). | ||
| */ | ||
| inline virtual void SetGlobal_nPointDomain(unsigned long val_global_npoint) {} | ||
|
|
||
| /*! | ||
| * \brief A virtual member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still contains a bit of CGeometry clean up, basically a lot of small methods that did not need to be virtual (get number of prisms and so on, a lot of these are only used in the legacy output) and a lot (all?) of the small turbomachinery set/get methods.
| CCompressedSparsePatternUL | ||
| finiteVolumeCSRFill0, /*!< \brief 0-fill FVM sparsity. */ | ||
| finiteVolumeCSRFillN, /*!< \brief N-fill FVM sparsity (e.g. for ILUn preconditioner). */ | ||
| finiteElementCSRFill0, /*!< \brief 0-fill FEM sparsity. */ | ||
| finiteElementCSRFillN; /*!< \brief N-fill FEM sparsity (e.g. for ILUn preconditioner). */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in #789 the sparsity patterns (row ptr and col idx) that CSysMatrix requires are now stored in CGeometry, this is to allow re-use (bulk and turbulence for example) and to amortise the "edge map" structure (of comparable size) I introduced to accelerate the update of CSysMatrix blocks.
| CCompressedSparsePatternUL | ||
| edgeColoring, /*!< \brief Edge coloring structure for thread-based parallelization. */ | ||
| elemColoring; /*!< \brief Element coloring structure for thread-based parallelization. */ | ||
| unsigned long edgeColorGroupSize = 1; /*!< \brief Size of the edge groups within each color. */ | ||
| unsigned long elemColorGroupSize = 1; /*!< \brief Size of the element groups within each color. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not yet used by this PR, these colorings will allow parallelizing edge/element loops using threads, more on that later.
| ScalarType *block; /*!< \brief Internal array to store a subblock of the matrix. */ | ||
| ScalarType *block_inverse; /*!< \brief Internal array to store a subblock of the matrix. */ | ||
| ScalarType *block_weight; /*!< \brief Internal array to store a subblock of the matrix. */ | ||
| ScalarType *prod_row_vector; /*!< \brief Internal array to store the product of a matrix-by-blocks "row" with a vector. */ | ||
| ScalarType *aux_vector; /*!< \brief Auxiliary array to store intermediate results. */ | ||
| ScalarType *sum_vector; /*!< \brief Auxiliary array to store intermediate results. */ | ||
|
|
||
| enum : size_t { MAXNVAR = 8 }; /*!< \brief Maximum number of variables the matrix can handle. The static | ||
| size is needed for fast, per-thread, static memory allocation. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First drawback of using threads, we need to be mindful of the thread-safety of the routines, the small working structures we had (block and so on) cannot be used by multiple threads simultaneously, the alternatives are:
- Allocate a larger chunk of memory and distribute it by the threads, this is a bit ugly and unnatural in small
parallel forconstructs, but reasonable for section where each thread does a lot of work. - Local dynamic allocation, which would hurt the performance of light routines and prevent optimizations.
- Local static allocation, which hurts generality for example, as the name
MAXNVARimplies if the matrix is asked to work on more than that bad things will happen (a runtime error is thrown that the code should be compiled with a larger number).
| /*! | ||
| * \brief Update 4 blocks ii, ij, ji, jj (add to i* sub from j*). | ||
| * \note The template parameter Sign, can be used create a "subtractive" | ||
| * update i.e. subtract from row i and add to row j instead. | ||
| * \param[in] edge - Index of edge that connects iPoint and jPoint. | ||
| * \param[in] iPoint - Row to which we add the blocks. | ||
| * \param[in] jPoint - Row from which we subtract the blocks. | ||
| * \param[in] block_i - Adds to ii, subs from ji. | ||
| * \param[in] block_j - Adds to ij, subs from jj. | ||
| */ | ||
| template<class OtherType, int Sign = 1> | ||
| inline void UpdateBlocks(unsigned long iEdge, unsigned long iPoint, unsigned long jPoint, | ||
| OtherType **block_i, OtherType **block_j) { | ||
|
|
||
| ScalarType *bii = &matrix[dia_ptr[iPoint]*nVar*nEqn]; | ||
| ScalarType *bjj = &matrix[dia_ptr[jPoint]*nVar*nEqn]; | ||
| ScalarType *bij = &matrix[edge_ptr(iEdge,0)*nVar*nEqn]; | ||
| ScalarType *bji = &matrix[edge_ptr(iEdge,1)*nVar*nEqn]; | ||
|
|
||
| unsigned long iVar, jVar, offset = 0; | ||
|
|
||
| for (iVar = 0; iVar < nVar; iVar++) { | ||
| for (jVar = 0; jVar < nEqn; jVar++) { | ||
| bii[offset] += PassiveAssign<ScalarType,OtherType>(block_i[iVar][jVar]) * Sign; | ||
| bij[offset] += PassiveAssign<ScalarType,OtherType>(block_j[iVar][jVar]) * Sign; | ||
| bji[offset] -= PassiveAssign<ScalarType,OtherType>(block_i[iVar][jVar]) * Sign; | ||
| bjj[offset] -= PassiveAssign<ScalarType,OtherType>(block_j[iVar][jVar]) * Sign; | ||
| ++offset; | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "fast" block update I mentioned, instead of looking for the indices of the blocks in the sparse structure we use the edge pointer to directly obtain them. This edge map is populated once (via the normal search process).
This is now being called by all FVM solvers (for FEM this would not pay off as many blocks are referenced by each element).
| const auto& csr = geometry->GetSparsePattern(type,0); | ||
|
|
||
| } | ||
| row_ptr = csr.outerPtr(); | ||
| col_ind = csr.innerIdx(); | ||
| dia_ptr = csr.diagPtr(); | ||
| nnz = csr.getNumNonZeros(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pattern coming from the CGeometry associated with the matrix, CGeometry does lazy construction of those patterns so there is no extra work or complicate logic on the config settings to determine types of solver etc.
| SU2_OMP_PAR_FOR_STAT(OMP_STAT_SIZE) | ||
| for (unsigned long index = 0; index < nnz*nVar*nEqn; index++) | ||
| matrix[index] = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple example of the use of one of those "macro encapsulated" pragmas.
| SU2_OMP_PAR_FOR_DYN(omp_chunk_size) | ||
| for (auto row_i = 0ul; row_i < nPointDomain; row_i++) { | ||
| auto prod_begin = row_i*nVar; // offset to beginning of block row_i | ||
| for(auto iVar = 0ul; iVar < nVar; iVar++) | ||
| prod[prod_begin+iVar] = 0.0; | ||
| for (auto index = row_ptr[row_i]; index < row_ptr[row_i+1]; index++) { | ||
| auto vec_begin = col_ind[index]*nVar; // offset to beginning of block col_ind[index] | ||
| auto mat_begin = index*nVar*nVar; // offset to beginning of matrix block[row_i][col_ind[indx]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-transposed matrix multiplication this is all we need to make it parallel, it is however not ideal as I will explain in a bit. Note all local variables, if those indexes were declared outside the loop we could have problems.
| SU2_OMP_PARALLEL_ON(omp_num_parts) | ||
| { | ||
| int thread = omp_get_thread_num(); | ||
| const auto begin = omp_partitions[thread]; | ||
| const auto end = omp_partitions[thread+1]; | ||
|
|
||
| ScalarType weight[MAXNVAR*MAXNVAR], aux_block[MAXNVAR*MAXNVAR]; | ||
|
|
||
| for (auto iPoint = begin+1; iPoint < end; iPoint++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For preconditioners that are serial by nature we have a parallel section where each thread works on its own large chunk of the matrix.
| /*--- Points in edge ---*/ | ||
| iPoint = geometry->edge[iEdge]->GetNode(0); | ||
| jPoint = geometry->edge[iEdge]->GetNode(1); | ||
| numerics->SetNormal(geometry->edge[iEdge]->GetNormal()); | ||
| /*--- Points in edge ---*/ | ||
| iPoint = geometry->edge[iEdge]->GetNode(0); | ||
| jPoint = geometry->edge[iEdge]->GetNode(1); | ||
| numerics->SetNormal(geometry->edge[iEdge]->GetNormal()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few indentation issues fixed here and there too.
|
I have covered all operations used in non adjoint use, the non ideal part of the implementation I mentioned above is that the parallelization is "local", i.e. we get to the operation we want to make parallel and launch the threads there, for simple vector-vector operations the overhead may be significant. |
…o add sparse matrices
|
Ok the "simple" version of "going parallel" whenever we get to a linear algebra operation did not make the cut. Overall StrategyThe strategy now is to start a parallel section in CSysSolve::Solve that covers building the preconditioner and solving the linear system. Communication ModelThe MPI + Threads communication model is very simple, currently only the master thread calls MPI routines (including PerformanceDisclaimer:
With this small case using 8 cores off a machine with two 2650v4 CPU, Intel MPI 2018 + GCC 8.2, the hybrid (2 ranks of 4 threads) approach is about 5% faster thank the MPI-only (8 ranks), I expect larger cases to have identical performance. How To
|
| ScalarType CSysVector<ScalarType>::dot(const CSysVector<ScalarType> & u) const { | ||
| #if !defined(CODI_FORWARD_TYPE) && !defined(CODI_REVERSE_TYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability wise, the dot product operation (now a member of CSysVector) is as bad as it can get as we need to perform a reduction over threads and ranks.
|
Following along here.. will you eventually move the parallel section in CSysSolve up to the full application level (in the main() maybe) when you move to the next steps? In the earlier work, we found that, as you have seen in the linear solver routines, spawning parallel sections kernel-wise carried a large overhead. We found that the best performance was given by spawning right at the start and carry the threads through the entire program, just like the MPI ranks. My only other comments, which it sounds like you are addressing, are to make the threads as transparent as possible to developers (shouldn't need to touch them unless they want to, like the MPI), and to make the compilation painless (disable/enable). Have you connected it to meson somehow yet? |
|
I should also mention though that moving the threading to a single high-level parallel section is also very problematic for readability/development. Folks will have to be aware that the threads are active, and it can be very error-prone. This was one of the major detractors of implementing the OpenMP framework as we had it in the C&F paper into the develop branch, even though the performance was quite good (and also the interoperability of threading and AD at the time). Any clever suggestions/techniques for hiding the threading as much as possible are most welcome. |
|
No meson option yet, it is a very small one though, the whole system works by detecting |
talbring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pcarruscag ! I like how this is implemented. The OMP structure does not really lead to a lot of overhead in terms of readability. I am curious to see what else we can do with this.
| inline void CMediMPIWrapper::Init_thread(int *argc, char ***argv, int required, int* provided) { | ||
| AMPI_Init_thread(argc,argv,required,provided); | ||
| MediTool::init(); | ||
| AMPI_Comm_rank(convertComm(currentComm), &Rank); | ||
| AMPI_Comm_size(convertComm(currentComm), &Size); | ||
|
|
||
| MinRankError = Size; | ||
| MPI_Win_create(&MinRankError, sizeof(int), sizeof(int), MPI_INFO_NULL, | ||
| currentComm, &winMinRankError); | ||
| winMinRankErrorInUse = true; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be alright as long as either Init_thread or Init is called, and not both.
| * the data is managed by CGeometry to allow re-use. ---*/ | ||
|
|
||
| } | ||
| const auto& csr = geometry->GetSparsePattern(type,0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what your plans are, but in the future we should then try to pass the parsity pattern to the constructor in order to remove the dependencies of the matrix class on the geometry class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice, but we still need the geometry for MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can move the point-to-point communication to its own structure. What do you think @economon ?
|
@talbring I think GitHub does not recognize the "approved" label as an approval (the bot put it back to un-reviewed). |
Hm yeah, the label will be removed if there has been a new commit ... unfortunately there is no way to change that ... |
Proposed Changes
The first work item from #824 is kind of in place code-wise (lots of testing required).
I will use this PR to document the implementation, while continuing the work on the main PR, and to keep the discussion "encapsulated".
Related Work
#789, #824
PR Checklist