Skip to content

Conversation

@pcarruscag
Copy link
Member

Proposed Changes

The hybrid parallel implementation uses grid coloring (edge and element) to avoid race conditions when building the residual vector and Jacobian matrix.
Depending on grid size, type, and ordering, it may be difficult to find a good coloring, i.e. providing locality and enough parallelism (all threads getting almost the same number of work chunks to work on per color).

This PR extends the use of the "reducer strategy" to fine grids (introduced and described in #861 for coarse grids, which before #894 I thought would be the only problematic scenarios) as a fallback strategy that is automatically selected when the coloring efficiency (highlighted below) drops below a threshold value. The reducer strategy can be forced by setting EDGE_COLORING_GROUP_SIZE to 0 (for some cases it seems to give better performance, contrary to the initial assessment from #789).
For element loops (CFEASolver and CMeshSolver, modified in #843) an automatic fallback was also introduced, but using "locks" (explained below) as a "reducer strategy" would use too much memory, and element loops are more compute intensive (and therefore less affected by the overhead of using these locks).

Related Work

Part of #789, namely:
Improves scaling of the work from #861 #843 (should help with #894)
Fixes a bug from #830

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
  • My contribution is commented and consistent with SU2 style.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

@pcarruscag pcarruscag changed the base branch from master to develop March 11, 2020 11:08
Copy link
Member Author

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

Comment on lines -87 to -91
/*!
* \brief Destructor of the CAkimaInterpolation class.
*/
~CAkimaInterpolation(){}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed for old compiler compatibility.

Comment on lines +641 to +647
/*--- Ideally compute time is proportional to total work over number of threads. ---*/
su2double ideal = coloring.getNumNonZeros() / su2double(numThreads);

/*--- In practice the total work is quantized first by colors and then by chunks. ---*/
Index_t real = 0;
for(Index_t color = 0; color < coloring.getOuterSize(); ++color)
real += chunkSize * roundUpDiv(roundUpDiv(coloring.getNumNonZeros(color), chunkSize), numThreads);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation of coloring efficiency is described here, it is just a simple heuristic.

Comment on lines -3680 to +3681
}
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few indentation fixes here and there.

Comment on lines 663 to 667
if (!smooth_ready) {
SU2_OMP_BARRIER
SU2_OMP_MASTER
{
auto nVar = b.GetNVar();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix relating to #830, sometimes the code would hang on the first iteration because of this:

if (!smooth_ready) { // if working variables have not been allocated we need to allocate...
  SU2_OMP_BARRIER // <- This is the fix.
  SU2_OMP_MASTER // ...only the master thread can do allocation as those variables are shared
  {
    // allocate variables
    ...
    smooth_ready = true; // set this flag so we don't allocate on the next call.
  }
  SU2_OMP_BARRIER // now we need a barrier so all threads "see" the new state of the variables.
  // but because we are "smart" the barrier is inside the "if" so that we don't hit it every time.
  // But if some threads arrive (very) late to the party they might see smooth_ready == true already
  // and they miss the barrier which results in a deadlock (all threads must arrive before execution
  // can continue). One solution is to make sure all threads are inside the "if" (via another barrier)
  // before changing the condition deciding whether that statement is executed.
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I think I did see this happen a couple of times when I briefly tested the multi-threading. I built SU2 with -Dwith-omp=true

It happened when I didn't specify a number of threads with the -t option. But it ran okay if I specified even -t 1. Does this fix that? Or do you have to specify a thread count if you build with OMP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you don't use option -t (--threads) SU2 picks up the number of threads from the environment, which often is the maximum. If you use MPI at the same time like this the system will probably be oversubscribed, I noticed that OpenMPI 3.1.4 on Ubuntu 16 deadlocks when this happens (oversubscribing with just threads does not seem to deadlock).
This fix was for a more subtle problem.

Another aspect of using MPI+threads is that you need to pay attention to the binding mode, some mpi will "--bind-to core" when the number of ranks is small, this then locks all threads spawn by each rank to the core. Always use "--bind-to socket/numa" as appropriate.

unsigned long IterLimit = min(RestartIter, MaxIter-IterLinSol);
IterLinSol += FGMRES_LinSolver(*LinSysRes_ptr, *LinSysSol_ptr, mat_vec, *precond, SolverTol, IterLimit, residual, ScreenOutput, config);
if ( residual < SolverTol*norm0 ) break;
if ( residual <= SolverTol*norm0 ) break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small fix, restarted FGMRES was going into an infinite loop when the RHS was zero (happens in mesh deformation cases).

Comment on lines +2521 to +2522
void CEulerSolver::CommonPreprocessing(CGeometry *geometry, CSolver **solver_container, CConfig *config, unsigned short iMesh,
unsigned short iRKStep, unsigned short RunTime_EqSystem, bool Output) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some cleanup by factoring out common (to both Euler and NS solvers) preprocessing steps.

Comment on lines +2615 to +2619
if(!ReducerStrategy && !Output) {
LinSysRes.SetValZero();
if (implicit && !disc_adjoint) Jacobian.SetValZero();
else {SU2_OMP_BARRIER} // because of "nowait" in LinSysRes
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jacobian.SetValZero(); bit was previously not guarded by !Output which was a bit wasteful.

Comment on lines -3635 to +3667
void CEulerSolver::SetUndivided_Laplacian(CGeometry *geometry, CConfig *config) {
void CEulerSolver::SetUndivided_Laplacian_And_Centered_Dissipation_Sensor(CGeometry *geometry, CConfig *config) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These routines were transformed into point loops (no need for grid coloring) and fused since they are always called together. (The code is actually much simpler especially w.r.t. the boundary_i, boundary_j conditionals)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines -832 to +865
SU2_OMP_FOR_DYN(roundUpDiv(OMP_MIN_SIZE, ColorGroupSize)*ColorGroupSize)
SU2_OMP_FOR_DYN(nextMultiple(OMP_MIN_SIZE, color.groupSize))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a number of places I did some replacing to make this operation more expressive.

Comment on lines +1336 to +1355
% --------------------- HYBRID PARALLEL (MPI+OpenMP) OPTIONS ---------------------%
%
% An advanced performance parameter for FVM solvers, a large-ish value should be best
% when relatively few threads per MPI rank are in use (~4). However, maximum parallelism
% is obtained with EDGE_COLORING_GROUP_SIZE=1, consider using this value only if SU2
% warns about low coloring efficiency during preprocessing (performance is usually worse).
% Setting the option to 0 disables coloring and a different strategy is used instead,
% that strategy is automatically used when the coloring efficiency is less than 0.875.
% The optimum value/strategy is case-dependent.
EDGE_COLORING_GROUP_SIZE= 512
%
% Independent "threads per MPI rank" setting for LU-SGS and ILU preconditioners.
% For problems where time is spend mostly in the solution of linear systems (e.g. elasticity,
% very high CFL central schemes), AND, if the memory bandwidth of the machine is saturated
% (4 or more cores per memory channel) better performance (via a reduction in linear iterations)
% may be possible by using a smaller value than that defined by the system or in the call to
% SU2_CFD (via the -t/--threads option).
% The default (0) means "same number of threads as for all else".
LINEAR_SOLVER_PREC_THREADS= 0
%
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New options introduced in #853 and #861 added to config_template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pcarruscag pcarruscag changed the title Hybrid parallel fallback strategies Hybrid parallel coloring fallback strategies for better strong scaling and user friendliness Mar 11, 2020
@pcarruscag pcarruscag changed the title Hybrid parallel coloring fallback strategies for better strong scaling and user friendliness Hybrid parallel coloring fallback strategies (better strong scaling and user friendliness) Mar 11, 2020
Comment on lines +391 to +397
/*--- Use the config option as an upper bound on elasticity modulus.
* For RANS meshes the range of element volume or wall distance is
* very large and leads to an ill-conditioned stiffness matrix.
* Absolute values of elasticity modulus are not important for
* mesh deformation, since linear elasticity is used and all
* boundary conditions are essential (Dirichlet). ---*/
const su2double maxE = config->GetDeform_ElasticityMod();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rsanfer I introduced a small change to the mesh elasticity, the config option above is used as an upper bound for elasticity modulus when using wall-distance or volume based mesh stiffness (for the reason in the comment).
When using constant stiffness the option is ignored and E is 1 everywhere, this change made no difference to the existing testcases.

Copy link
Member Author

@pcarruscag pcarruscag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the subject of the lock fallback strategy for element loops (which I still have to stress test a bit).
A lock is a kind of stop light, that can be used to protect against concurrent access to a shared resource (e.g. memory location, stream object, etc.).
The usage pattern is to set them (omp_set_lock) before using the resource and unset them (omp_unset_lock) after. If the lock was already set when a thread calls omp_set_lock it will wait there until the thread who set it first unsets it. The OpenMP runtime make the set/unset process itself thread safe (as it involves more than setting some flag as true or false).

Comment on lines 913 to +915
for (iNode = 0; iNode < nNodes; iNode++) {

if (LockStrategy) omp_set_lock(&UpdateLocks[indexNode[iNode]]);
Copy link
Member Author

@pcarruscag pcarruscag Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the loops over the points associated with an element, that follow the computation of element stiffness matrices (for example) we protected against concurrent accesses by setting (or acquiring) a lock for the point before writing to shared memory locations. Once done we unset the lock.
One downside of this strategy is that, contrary to coloring, the operations no longer have deterministic behaviour as which thread adds its contribution first depends on order of arrival, and so machine accuracy differences between otherwise identical runs should be expected.

Comment on lines +83 to +94
/*!
* \brief Dummy lock type and associated functions.
*/
struct omp_lock_t {};
struct DummyVectorOfLocks {
omp_lock_t l;
inline omp_lock_t& operator[](int) {return l;}
};
inline void omp_init_lock(omp_lock_t*){}
inline void omp_set_lock(omp_lock_t*){}
inline void omp_unset_lock(omp_lock_t*){}
inline void omp_destroy_lock(omp_lock_t*){}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual we define "do-nothing" types and functions when compiling without hybrid parallel support to make compilation compatible without having to throw #ifdefs everywhere.

Comment on lines 96 to +104
const bool muscl = config->GetMUSCL_Turb();
const bool limiter = (config->GetKind_SlopeLimit_Turb() != NO_LIMITER);

/*--- Only reconstruct flow variables if MUSCL is on for flow (requires upwind) and turbulence. ---*/
const bool musclFlow = config->GetMUSCL_Flow() && muscl &&
(config->GetKind_ConvNumScheme_Flow() == SPACE_UPWIND);
/*--- Only consider flow limiters for cell-based limiters, edge-based would need to be recomputed. ---*/
const bool limiterFlow = (config->GetKind_SlopeLimit_Flow() != NO_LIMITER) &&
(config->GetKind_SlopeLimit_Flow() != VAN_ALBADA_EDGE);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are introducing turbulence-related fixes in #905, I wonder if it is a good time to revisit the reconstruction logic in the turbulent solver (which currently is not adequate for centered schemes), this only affects a couple of regressions.
@jayantmukho , @economon , @clarkpede what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you have revisited some of the logic here already?

I haven't looked at the reconstructions before. Before these changes the MUSCL reconstruction for all (flow and turbulent) variables was only done when MUSCL_TURB= YES (when MUSCL reconstruction was requested for the turbulence solver as well).

Here you have changed it such that the flow variables get reconstructed if MUSCL_FLOW = YES but the turbulent variables are treated as before, i.e. they get reconstructed only when MUSCL_TURB= YES.

Are you suggesting changing the reconstruction itself? Or having an alternate reconstruction when centered schemes are used? Either way, I am for it, but don't know any better reconstruction schemes. Do you have any references I could check out? CFD wiki is best I've got so far 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I am surprised this didn't change more regression tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is all I think the logic needs. Central schemes never use MUSCL so it did not make sense to reconstruct the flow variables using gradients.
This only affects cases with central schemes + MUSCL turbulence (we seem to have two :) )

In theory what we need here is not a reconstruction, but a recomputation of the mass flux that is consistent with the main solver. I played with that once here #721 #726, by storing the mass flux from the convective residual loop. It made convergence worse since it un-staggers the solution (block Gauss-Seidel vs block Jacobi) so I did not pursue it further...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind, I didn't see the && muscl in the declaration of musclFlow.

I am not sure why you would need a recomputation of the mass flux as opposed to flow variables. The reconstructions in the EulerSolver are identical to that in TurbSolver (except for things like the low mach corrections and non-physical points). I can see an argument made for copying all those corrections in the TurbSolver as well.

Copy link
Member Author

@pcarruscag pcarruscag Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want the mass fluxes to respect the discretized continuity equation then they should be consistent with what is computed by the flow solver (otherwise you are introducing a source term that is proportional to the mass imbalance). In any case this did not seem to make a huge difference... But it is the recommended approach especially for some incompressible methods (where the mass flux is kept separate anyway for other reasons).

Copy link
Member

@talbring talbring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked through everything. Nothing to complain. Thanks @pcarruscag Also for the small (unrelated) improvements in between!

Comment on lines -3635 to +3667
void CEulerSolver::SetUndivided_Laplacian(CGeometry *geometry, CConfig *config) {
void CEulerSolver::SetUndivided_Laplacian_And_Centered_Dissipation_Sensor(CGeometry *geometry, CConfig *config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jayantmukho jayantmukho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can continue the conversation on changing reconstruction on #905 if you want to pull this PR in.

Comment on lines 663 to 667
if (!smooth_ready) {
SU2_OMP_BARRIER
SU2_OMP_MASTER
{
auto nVar = b.GetNVar();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I think I did see this happen a couple of times when I briefly tested the multi-threading. I built SU2 with -Dwith-omp=true

It happened when I didn't specify a number of threads with the -t option. But it ran okay if I specified even -t 1. Does this fix that? Or do you have to specify a thread count if you build with OMP?

Comment on lines 96 to +104
const bool muscl = config->GetMUSCL_Turb();
const bool limiter = (config->GetKind_SlopeLimit_Turb() != NO_LIMITER);

/*--- Only reconstruct flow variables if MUSCL is on for flow (requires upwind) and turbulence. ---*/
const bool musclFlow = config->GetMUSCL_Flow() && muscl &&
(config->GetKind_ConvNumScheme_Flow() == SPACE_UPWIND);
/*--- Only consider flow limiters for cell-based limiters, edge-based would need to be recomputed. ---*/
const bool limiterFlow = (config->GetKind_SlopeLimit_Flow() != NO_LIMITER) &&
(config->GetKind_SlopeLimit_Flow() != VAN_ALBADA_EDGE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you have revisited some of the logic here already?

I haven't looked at the reconstructions before. Before these changes the MUSCL reconstruction for all (flow and turbulent) variables was only done when MUSCL_TURB= YES (when MUSCL reconstruction was requested for the turbulence solver as well).

Here you have changed it such that the flow variables get reconstructed if MUSCL_FLOW = YES but the turbulent variables are treated as before, i.e. they get reconstructed only when MUSCL_TURB= YES.

Are you suggesting changing the reconstruction itself? Or having an alternate reconstruction when centered schemes are used? Either way, I am for it, but don't know any better reconstruction schemes. Do you have any references I could check out? CFD wiki is best I've got so far 😛

Comment on lines +1336 to +1355
% --------------------- HYBRID PARALLEL (MPI+OpenMP) OPTIONS ---------------------%
%
% An advanced performance parameter for FVM solvers, a large-ish value should be best
% when relatively few threads per MPI rank are in use (~4). However, maximum parallelism
% is obtained with EDGE_COLORING_GROUP_SIZE=1, consider using this value only if SU2
% warns about low coloring efficiency during preprocessing (performance is usually worse).
% Setting the option to 0 disables coloring and a different strategy is used instead,
% that strategy is automatically used when the coloring efficiency is less than 0.875.
% The optimum value/strategy is case-dependent.
EDGE_COLORING_GROUP_SIZE= 512
%
% Independent "threads per MPI rank" setting for LU-SGS and ILU preconditioners.
% For problems where time is spend mostly in the solution of linear systems (e.g. elasticity,
% very high CFL central schemes), AND, if the memory bandwidth of the machine is saturated
% (4 or more cores per memory channel) better performance (via a reduction in linear iterations)
% may be possible by using a smaller value than that defined by the system or in the call to
% SU2_CFD (via the -t/--threads option).
% The default (0) means "same number of threads as for all else".
LINEAR_SOLVER_PREC_THREADS= 0
%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 96 to +104
const bool muscl = config->GetMUSCL_Turb();
const bool limiter = (config->GetKind_SlopeLimit_Turb() != NO_LIMITER);

/*--- Only reconstruct flow variables if MUSCL is on for flow (requires upwind) and turbulence. ---*/
const bool musclFlow = config->GetMUSCL_Flow() && muscl &&
(config->GetKind_ConvNumScheme_Flow() == SPACE_UPWIND);
/*--- Only consider flow limiters for cell-based limiters, edge-based would need to be recomputed. ---*/
const bool limiterFlow = (config->GetKind_SlopeLimit_Flow() != NO_LIMITER) &&
(config->GetKind_SlopeLimit_Flow() != VAN_ALBADA_EDGE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I am surprised this didn't change more regression tests.

@pcarruscag pcarruscag merged commit 61a3efd into develop Mar 26, 2020
@pcarruscag pcarruscag deleted the hybrid_parallel_fallback_strategies branch March 26, 2020 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants