Skip to content

Conversation

@pcarruscag
Copy link
Member

Proposed Changes

The first work item from #824 is kind of in place code-wise (lots of testing required).
I will use this PR to document the implementation, while continuing the work on the main PR, and to keep the discussion "encapsulated".

Related Work

#789, #824

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
  • My contribution is commented and consistent with SU2 style.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

Comment on lines -1028 to -1046
* \brief A virtual member.
* \return Total number of nodes in a simulation across all processors (including halos).
*/
inline virtual unsigned long GetGlobal_nPoint() const { return 0; }

/*!
* \brief A virtual member.
* \return Total number of nodes in a simulation across all processors (excluding halos).
*/
inline virtual unsigned long GetGlobal_nPointDomain() const { return 0; }

/*!
* \brief A virtual member.
* \param[in] val_global_npoint - Global number of points in the mesh (excluding halos).
*/
inline virtual void SetGlobal_nPointDomain(unsigned long val_global_npoint) {}

/*!
* \brief A virtual member.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still contains a bit of CGeometry clean up, basically a lot of small methods that did not need to be virtual (get number of prisms and so on, a lot of these are only used in the legacy output) and a lot (all?) of the small turbomachinery set/get methods.

Comment on lines +164 to +168
CCompressedSparsePatternUL
finiteVolumeCSRFill0, /*!< \brief 0-fill FVM sparsity. */
finiteVolumeCSRFillN, /*!< \brief N-fill FVM sparsity (e.g. for ILUn preconditioner). */
finiteElementCSRFill0, /*!< \brief 0-fill FEM sparsity. */
finiteElementCSRFillN; /*!< \brief N-fill FEM sparsity (e.g. for ILUn preconditioner). */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #789 the sparsity patterns (row ptr and col idx) that CSysMatrix requires are now stored in CGeometry, this is to allow re-use (bulk and turbulence for example) and to amortise the "edge map" structure (of comparable size) I introduced to accelerate the update of CSysMatrix blocks.

Comment on lines +174 to +178
CCompressedSparsePatternUL
edgeColoring, /*!< \brief Edge coloring structure for thread-based parallelization. */
elemColoring; /*!< \brief Element coloring structure for thread-based parallelization. */
unsigned long edgeColorGroupSize = 1; /*!< \brief Size of the edge groups within each color. */
unsigned long elemColorGroupSize = 1; /*!< \brief Size of the element groups within each color. */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet used by this PR, these colorings will allow parallelizing edge/element loops using threads, more on that later.

Comment on lines 92 to 77
ScalarType *block; /*!< \brief Internal array to store a subblock of the matrix. */
ScalarType *block_inverse; /*!< \brief Internal array to store a subblock of the matrix. */
ScalarType *block_weight; /*!< \brief Internal array to store a subblock of the matrix. */
ScalarType *prod_row_vector; /*!< \brief Internal array to store the product of a matrix-by-blocks "row" with a vector. */
ScalarType *aux_vector; /*!< \brief Auxiliary array to store intermediate results. */
ScalarType *sum_vector; /*!< \brief Auxiliary array to store intermediate results. */

enum : size_t { MAXNVAR = 8 }; /*!< \brief Maximum number of variables the matrix can handle. The static
size is needed for fast, per-thread, static memory allocation. */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First drawback of using threads, we need to be mindful of the thread-safety of the routines, the small working structures we had (block and so on) cannot be used by multiple threads simultaneously, the alternatives are:

  • Allocate a larger chunk of memory and distribute it by the threads, this is a bit ugly and unnatural in small parallel for constructs, but reasonable for section where each thread does a lot of work.
  • Local dynamic allocation, which would hurt the performance of light routines and prevent optimizations.
  • Local static allocation, which hurts generality for example, as the name MAXNVAR implies if the matrix is asked to work on more than that bad things will happen (a runtime error is thrown that the code should be compiled with a larger number).

Comment on lines +515 to +546
/*!
* \brief Update 4 blocks ii, ij, ji, jj (add to i* sub from j*).
* \note The template parameter Sign, can be used create a "subtractive"
* update i.e. subtract from row i and add to row j instead.
* \param[in] edge - Index of edge that connects iPoint and jPoint.
* \param[in] iPoint - Row to which we add the blocks.
* \param[in] jPoint - Row from which we subtract the blocks.
* \param[in] block_i - Adds to ii, subs from ji.
* \param[in] block_j - Adds to ij, subs from jj.
*/
template<class OtherType, int Sign = 1>
inline void UpdateBlocks(unsigned long iEdge, unsigned long iPoint, unsigned long jPoint,
OtherType **block_i, OtherType **block_j) {

ScalarType *bii = &matrix[dia_ptr[iPoint]*nVar*nEqn];
ScalarType *bjj = &matrix[dia_ptr[jPoint]*nVar*nEqn];
ScalarType *bij = &matrix[edge_ptr(iEdge,0)*nVar*nEqn];
ScalarType *bji = &matrix[edge_ptr(iEdge,1)*nVar*nEqn];

unsigned long iVar, jVar, offset = 0;

for (iVar = 0; iVar < nVar; iVar++) {
for (jVar = 0; jVar < nEqn; jVar++) {
bii[offset] += PassiveAssign<ScalarType,OtherType>(block_i[iVar][jVar]) * Sign;
bij[offset] += PassiveAssign<ScalarType,OtherType>(block_j[iVar][jVar]) * Sign;
bji[offset] -= PassiveAssign<ScalarType,OtherType>(block_i[iVar][jVar]) * Sign;
bjj[offset] -= PassiveAssign<ScalarType,OtherType>(block_j[iVar][jVar]) * Sign;
++offset;
}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "fast" block update I mentioned, instead of looking for the indices of the blocks in the sparse structure we use the edge pointer to directly obtain them. This edge map is populated once (via the normal search process).
This is now being called by all FVM solvers (for FEM this would not pay off as many blocks are referenced by each element).

Comment on lines 121 to 126
const auto& csr = geometry->GetSparsePattern(type,0);

}
row_ptr = csr.outerPtr();
col_ind = csr.innerIdx();
dia_ptr = csr.diagPtr();
nnz = csr.getNumNonZeros();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern coming from the CGeometry associated with the matrix, CGeometry does lazy construction of those patterns so there is no extra work or complicate logic on the config settings to determine types of solver etc.

Comment on lines 485 to 487
SU2_OMP_PAR_FOR_STAT(OMP_STAT_SIZE)
for (unsigned long index = 0; index < nnz*nVar*nEqn; index++)
matrix[index] = 0.0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple example of the use of one of those "macro encapsulated" pragmas.

Comment on lines 631 to 638
SU2_OMP_PAR_FOR_DYN(omp_chunk_size)
for (auto row_i = 0ul; row_i < nPointDomain; row_i++) {
auto prod_begin = row_i*nVar; // offset to beginning of block row_i
for(auto iVar = 0ul; iVar < nVar; iVar++)
prod[prod_begin+iVar] = 0.0;
for (auto index = row_ptr[row_i]; index < row_ptr[row_i+1]; index++) {
auto vec_begin = col_ind[index]*nVar; // offset to beginning of block col_ind[index]
auto mat_begin = index*nVar*nVar; // offset to beginning of matrix block[row_i][col_ind[indx]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-transposed matrix multiplication this is all we need to make it parallel, it is however not ideal as I will explain in a bit. Note all local variables, if those indexes were declared outside the loop we could have problems.

Comment on lines 745 to 753
SU2_OMP_PARALLEL_ON(omp_num_parts)
{
int thread = omp_get_thread_num();
const auto begin = omp_partitions[thread];
const auto end = omp_partitions[thread+1];

ScalarType weight[MAXNVAR*MAXNVAR], aux_block[MAXNVAR*MAXNVAR];

for (auto iPoint = begin+1; iPoint < end; iPoint++) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For preconditioners that are serial by nature we have a parallel section where each thread works on its own large chunk of the matrix.

Comment on lines -525 to +528
/*--- Points in edge ---*/
iPoint = geometry->edge[iEdge]->GetNode(0);
jPoint = geometry->edge[iEdge]->GetNode(1);
numerics->SetNormal(geometry->edge[iEdge]->GetNormal());
/*--- Points in edge ---*/
iPoint = geometry->edge[iEdge]->GetNode(0);
jPoint = geometry->edge[iEdge]->GetNode(1);
numerics->SetNormal(geometry->edge[iEdge]->GetNormal());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few indentation issues fixed here and there too.

@pcarruscag
Copy link
Member Author

I have covered all operations used in non adjoint use, the non ideal part of the implementation I mentioned above is that the parallelization is "local", i.e. we get to the operation we want to make parallel and launch the threads there, for simple vector-vector operations the overhead may be significant.
Ideally we would have a parallel construct at a higher level, say CSysSolve::Solve, so that the threads are already in flight when we get to those small operations.
In principle it is not too hard to do that, but it needs to be done carefully especially when the execution gets to an MPI part of the code (which thread(s) communicate, etc.).
I will try to benchmark this to put numbers on the performance / simplicity trade-off.

@pcarruscag
Copy link
Member Author

Ok the "simple" version of "going parallel" whenever we get to a linear algebra operation did not make the cut.
On an older architecture there was a 10% slowdown of the linear solvers at ~10k nodes per core and about the same on a newer architecture but only at ~1k node per core.
Since hybrid parallel is supposed to be good for strong scaling, this was not good enough... With the new strategy it is ok (see "performance" below), hence this is ready for review.

Overall Strategy

The strategy now is to start a parallel section in CSysSolve::Solve that covers building the preconditioner and solving the linear system.
Linear algebra routines called within this section have worksharing constructs instead of parallel ones, i.e. the work is distributed by however many threads arrive to that routine. This also makes the routines safe to call in serial.
The only "dangerous" things to do in parallel are to: manage memory for a shared object (multiple threads call new but there is only one shared pointer on which to call delete); writing to the same memory locations concurrently.
I tried to make the first issue debugable by asserting that the initialization routines of CSysMatrix and CSysVector are only called by the master thread.
For the second issue I made the associated classes as const-correct as possible, that should at least make someone think twice before changing a member variable of those classes. The risk is still there for input variables as an algorithm development aspect... For example MatrixVectorProductTransposed cannot be made thread-parallel as simply/naively as its normal counterpart.

Communication Model

The MPI + Threads communication model is very simple, currently only the master thread calls MPI routines (including Error), this requires thread barriers before and after the communication to make sure the correct values are passed and seen by all threads.
We can test other alternatives in the future but at the moment this does not seem to be a significant bottleneck.
Worksharing constructs have implicit barriers at completion, for CSysVector routines I used nowait modifiers, it is safe to call those routines in sequence since the loop sizes, and static work scheduling specifications are identical.
However, routines that access a CSysVector in a different way, should have an explicit barrier before using the vector (or risk having undefined behaviour). You will see these barriers on entry to matrix-vector product, and every ComputeXXXPreconditioner (if you don't, let me know xD). I think those routines are large enough to amortise the cost of this.

Performance

Disclaimer:

  • We are talking about linear solvers only, you will not see a global improvement yet.
  • The large global improvements from "hybridization" will come from the multigrid behaving better on less decomposed domains, and from the ability to independently tune the number of cores used in the linear preconditioners. For now the objective is "just" not to loose performance while gaining flexibility.
  • The performance of MPI+threads with 1 thread per rank will be worse than just MPI (no free lunches).

With this small case using 8 cores off a machine with two 2650v4 CPU, Intel MPI 2018 + GCC 8.2, the hybrid (2 ranks of 4 threads) approach is about 5% faster thank the MPI-only (8 ranks), I expect larger cases to have identical performance.

How To

  • Compile: Add -fopenmp to the compiler and linker arguments.
  • Run: Set number of threads with env variable OMP_NUM_THREADS (eventually I will make that a command line parameter), for best performance set OMP_WAIT_POLICY=ACTIVE and beware of thread binding settings, use mpirun --bind-to socket or mpirun --bind-to numa never core.

Comment on lines +224 to +225
ScalarType CSysVector<ScalarType>::dot(const CSysVector<ScalarType> & u) const {
#if !defined(CODI_FORWARD_TYPE) && !defined(CODI_REVERSE_TYPE)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability wise, the dot product operation (now a member of CSysVector) is as bad as it can get as we need to perform a reduction over threads and ranks.

@economon
Copy link
Member

Following along here.. will you eventually move the parallel section in CSysSolve up to the full application level (in the main() maybe) when you move to the next steps? In the earlier work, we found that, as you have seen in the linear solver routines, spawning parallel sections kernel-wise carried a large overhead. We found that the best performance was given by spawning right at the start and carry the threads through the entire program, just like the MPI ranks.

My only other comments, which it sounds like you are addressing, are to make the threads as transparent as possible to developers (shouldn't need to touch them unless they want to, like the MPI), and to make the compilation painless (disable/enable). Have you connected it to meson somehow yet?

@economon
Copy link
Member

I should also mention though that moving the threading to a single high-level parallel section is also very problematic for readability/development. Folks will have to be aware that the threads are active, and it can be very error-prone. This was one of the major detractors of implementing the OpenMP framework as we had it in the C&F paper into the develop branch, even though the performance was quite good (and also the interoperability of threading and AD at the time). Any clever suggestions/techniques for hiding the threading as much as possible are most welcome.

@pcarruscag
Copy link
Member Author

pcarruscag commented Dec 17, 2019

No meson option yet, it is a very small one though, the whole system works by detecting -fopenmp.
Your second comment is the main argument against moving the parallel section further up.
Allocation routines have the highest risk of making a mess, but even seemingly innocuous things like the small auxiliary arrays we allocate e.g. in CSolver and then use in derived classes are a problem.
I am almost done making the FEA solver completely hybrid parallel and I had to refactor most uses of those arrays. This is also why I took a more functional approach to the new limiter and gradient routines. The way we use CConfig is also not thread safe, we would need to make all "SetSomethings" atomic, which would be monumental.
Initially I would have a few parallel sections (it is not too difficult to then move them up if we think that is the way to go) I want to use the FEA solver to get an idea for the relative performance, after seeing the effect of OMP_WAIT_POLICY I am optimistic.

Copy link
Member

@talbring talbring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pcarruscag ! I like how this is implemented. The OMP structure does not really lead to a lot of overhead in terms of readability. I am curious to see what else we can do with this.

Comment on lines +296 to +307
inline void CMediMPIWrapper::Init_thread(int *argc, char ***argv, int required, int* provided) {
AMPI_Init_thread(argc,argv,required,provided);
MediTool::init();
AMPI_Comm_rank(convertComm(currentComm), &Rank);
AMPI_Comm_size(convertComm(currentComm), &Size);

MinRankError = Size;
MPI_Win_create(&MinRankError, sizeof(int), sizeof(int), MPI_INFO_NULL,
currentComm, &winMinRankError);
winMinRankErrorInUse = true;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be alright as long as either Init_thread or Init is called, and not both.

* the data is managed by CGeometry to allow re-use. ---*/

}
const auto& csr = geometry->GetSparsePattern(type,0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what your plans are, but in the future we should then try to pass the parsity pattern to the constructor in order to remove the dependencies of the matrix class on the geometry class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but we still need the geometry for MPI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can move the point-to-point communication to its own structure. What do you think @economon ?

@pcarruscag
Copy link
Member Author

@talbring I think GitHub does not recognize the "approved" label as an approval (the bot put it back to un-reviewed).

@pcarruscag
Copy link
Member Author

I think this is stable enough to be merged, while testing #834 and #843 I think I went through most combinations of solvers/preconditioners and things seem to work fine, the only source of problems was the dot product (it was locking in some conditions) but all seems fine now.

@talbring
Copy link
Member

@talbring I think GitHub does not recognize the "approved" label as an approval (the bot put it back to un-reviewed).

Hm yeah, the label will be removed if there has been a new commit ... unfortunately there is no way to change that ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants