-
Notifications
You must be signed in to change notification settings - Fork 918
Hybrid parallel coloring fallback strategies (better strong scaling and user friendliness) #908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ffness, cleanup unused vars
…d also on the fine grid
…ut' into feature_hybrid_parallel_and_SIMD
…ation in CConfig.cpp
pcarruscag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
| /*! | ||
| * \brief Destructor of the CAkimaInterpolation class. | ||
| */ | ||
| ~CAkimaInterpolation(){} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed for old compiler compatibility.
| /*--- Ideally compute time is proportional to total work over number of threads. ---*/ | ||
| su2double ideal = coloring.getNumNonZeros() / su2double(numThreads); | ||
|
|
||
| /*--- In practice the total work is quantized first by colors and then by chunks. ---*/ | ||
| Index_t real = 0; | ||
| for(Index_t color = 0; color < coloring.getOuterSize(); ++color) | ||
| real += chunkSize * roundUpDiv(roundUpDiv(coloring.getNumNonZeros(color), chunkSize), numThreads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation of coloring efficiency is described here, it is just a simple heuristic.
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few indentation fixes here and there.
| if (!smooth_ready) { | ||
| SU2_OMP_BARRIER | ||
| SU2_OMP_MASTER | ||
| { | ||
| auto nVar = b.GetNVar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug fix relating to #830, sometimes the code would hang on the first iteration because of this:
if (!smooth_ready) { // if working variables have not been allocated we need to allocate...
SU2_OMP_BARRIER // <- This is the fix.
SU2_OMP_MASTER // ...only the master thread can do allocation as those variables are shared
{
// allocate variables
...
smooth_ready = true; // set this flag so we don't allocate on the next call.
}
SU2_OMP_BARRIER // now we need a barrier so all threads "see" the new state of the variables.
// but because we are "smart" the barrier is inside the "if" so that we don't hit it every time.
// But if some threads arrive (very) late to the party they might see smooth_ready == true already
// and they miss the barrier which results in a deadlock (all threads must arrive before execution
// can continue). One solution is to make sure all threads are inside the "if" (via another barrier)
// before changing the condition deciding whether that statement is executed.
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I think I did see this happen a couple of times when I briefly tested the multi-threading. I built SU2 with -Dwith-omp=true
It happened when I didn't specify a number of threads with the -t option. But it ran okay if I specified even -t 1. Does this fix that? Or do you have to specify a thread count if you build with OMP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you don't use option -t (--threads) SU2 picks up the number of threads from the environment, which often is the maximum. If you use MPI at the same time like this the system will probably be oversubscribed, I noticed that OpenMPI 3.1.4 on Ubuntu 16 deadlocks when this happens (oversubscribing with just threads does not seem to deadlock).
This fix was for a more subtle problem.
Another aspect of using MPI+threads is that you need to pay attention to the binding mode, some mpi will "--bind-to core" when the number of ranks is small, this then locks all threads spawn by each rank to the core. Always use "--bind-to socket/numa" as appropriate.
| unsigned long IterLimit = min(RestartIter, MaxIter-IterLinSol); | ||
| IterLinSol += FGMRES_LinSolver(*LinSysRes_ptr, *LinSysSol_ptr, mat_vec, *precond, SolverTol, IterLimit, residual, ScreenOutput, config); | ||
| if ( residual < SolverTol*norm0 ) break; | ||
| if ( residual <= SolverTol*norm0 ) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small fix, restarted FGMRES was going into an infinite loop when the RHS was zero (happens in mesh deformation cases).
| void CEulerSolver::CommonPreprocessing(CGeometry *geometry, CSolver **solver_container, CConfig *config, unsigned short iMesh, | ||
| unsigned short iRKStep, unsigned short RunTime_EqSystem, bool Output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some cleanup by factoring out common (to both Euler and NS solvers) preprocessing steps.
| if(!ReducerStrategy && !Output) { | ||
| LinSysRes.SetValZero(); | ||
| if (implicit && !disc_adjoint) Jacobian.SetValZero(); | ||
| else {SU2_OMP_BARRIER} // because of "nowait" in LinSysRes | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Jacobian.SetValZero(); bit was previously not guarded by !Output which was a bit wasteful.
| void CEulerSolver::SetUndivided_Laplacian(CGeometry *geometry, CConfig *config) { | ||
| void CEulerSolver::SetUndivided_Laplacian_And_Centered_Dissipation_Sensor(CGeometry *geometry, CConfig *config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These routines were transformed into point loops (no need for grid coloring) and fused since they are always called together. (The code is actually much simpler especially w.r.t. the boundary_i, boundary_j conditionals)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| SU2_OMP_FOR_DYN(roundUpDiv(OMP_MIN_SIZE, ColorGroupSize)*ColorGroupSize) | ||
| SU2_OMP_FOR_DYN(nextMultiple(OMP_MIN_SIZE, color.groupSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a number of places I did some replacing to make this operation more expressive.
| % --------------------- HYBRID PARALLEL (MPI+OpenMP) OPTIONS ---------------------% | ||
| % | ||
| % An advanced performance parameter for FVM solvers, a large-ish value should be best | ||
| % when relatively few threads per MPI rank are in use (~4). However, maximum parallelism | ||
| % is obtained with EDGE_COLORING_GROUP_SIZE=1, consider using this value only if SU2 | ||
| % warns about low coloring efficiency during preprocessing (performance is usually worse). | ||
| % Setting the option to 0 disables coloring and a different strategy is used instead, | ||
| % that strategy is automatically used when the coloring efficiency is less than 0.875. | ||
| % The optimum value/strategy is case-dependent. | ||
| EDGE_COLORING_GROUP_SIZE= 512 | ||
| % | ||
| % Independent "threads per MPI rank" setting for LU-SGS and ILU preconditioners. | ||
| % For problems where time is spend mostly in the solution of linear systems (e.g. elasticity, | ||
| % very high CFL central schemes), AND, if the memory bandwidth of the machine is saturated | ||
| % (4 or more cores per memory channel) better performance (via a reduction in linear iterations) | ||
| % may be possible by using a smaller value than that defined by the system or in the call to | ||
| % SU2_CFD (via the -t/--threads option). | ||
| % The default (0) means "same number of threads as for all else". | ||
| LINEAR_SOLVER_PREC_THREADS= 0 | ||
| % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| /*--- Use the config option as an upper bound on elasticity modulus. | ||
| * For RANS meshes the range of element volume or wall distance is | ||
| * very large and leads to an ill-conditioned stiffness matrix. | ||
| * Absolute values of elasticity modulus are not important for | ||
| * mesh deformation, since linear elasticity is used and all | ||
| * boundary conditions are essential (Dirichlet). ---*/ | ||
| const su2double maxE = config->GetDeform_ElasticityMod(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rsanfer I introduced a small change to the mesh elasticity, the config option above is used as an upper bound for elasticity modulus when using wall-distance or volume based mesh stiffness (for the reason in the comment).
When using constant stiffness the option is ignored and E is 1 everywhere, this change made no difference to the existing testcases.
pcarruscag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the subject of the lock fallback strategy for element loops (which I still have to stress test a bit).
A lock is a kind of stop light, that can be used to protect against concurrent access to a shared resource (e.g. memory location, stream object, etc.).
The usage pattern is to set them (omp_set_lock) before using the resource and unset them (omp_unset_lock) after. If the lock was already set when a thread calls omp_set_lock it will wait there until the thread who set it first unsets it. The OpenMP runtime make the set/unset process itself thread safe (as it involves more than setting some flag as true or false).
| for (iNode = 0; iNode < nNodes; iNode++) { | ||
|
|
||
| if (LockStrategy) omp_set_lock(&UpdateLocks[indexNode[iNode]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the loops over the points associated with an element, that follow the computation of element stiffness matrices (for example) we protected against concurrent accesses by setting (or acquiring) a lock for the point before writing to shared memory locations. Once done we unset the lock.
One downside of this strategy is that, contrary to coloring, the operations no longer have deterministic behaviour as which thread adds its contribution first depends on order of arrival, and so machine accuracy differences between otherwise identical runs should be expected.
| /*! | ||
| * \brief Dummy lock type and associated functions. | ||
| */ | ||
| struct omp_lock_t {}; | ||
| struct DummyVectorOfLocks { | ||
| omp_lock_t l; | ||
| inline omp_lock_t& operator[](int) {return l;} | ||
| }; | ||
| inline void omp_init_lock(omp_lock_t*){} | ||
| inline void omp_set_lock(omp_lock_t*){} | ||
| inline void omp_unset_lock(omp_lock_t*){} | ||
| inline void omp_destroy_lock(omp_lock_t*){} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As usual we define "do-nothing" types and functions when compiling without hybrid parallel support to make compilation compatible without having to throw #ifdefs everywhere.
| const bool muscl = config->GetMUSCL_Turb(); | ||
| const bool limiter = (config->GetKind_SlopeLimit_Turb() != NO_LIMITER); | ||
|
|
||
| /*--- Only reconstruct flow variables if MUSCL is on for flow (requires upwind) and turbulence. ---*/ | ||
| const bool musclFlow = config->GetMUSCL_Flow() && muscl && | ||
| (config->GetKind_ConvNumScheme_Flow() == SPACE_UPWIND); | ||
| /*--- Only consider flow limiters for cell-based limiters, edge-based would need to be recomputed. ---*/ | ||
| const bool limiterFlow = (config->GetKind_SlopeLimit_Flow() != NO_LIMITER) && | ||
| (config->GetKind_SlopeLimit_Flow() != VAN_ALBADA_EDGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are introducing turbulence-related fixes in #905, I wonder if it is a good time to revisit the reconstruction logic in the turbulent solver (which currently is not adequate for centered schemes), this only affects a couple of regressions.
@jayantmukho , @economon , @clarkpede what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you have revisited some of the logic here already?
I haven't looked at the reconstructions before. Before these changes the MUSCL reconstruction for all (flow and turbulent) variables was only done when MUSCL_TURB= YES (when MUSCL reconstruction was requested for the turbulence solver as well).
Here you have changed it such that the flow variables get reconstructed if MUSCL_FLOW = YES but the turbulent variables are treated as before, i.e. they get reconstructed only when MUSCL_TURB= YES.
Are you suggesting changing the reconstruction itself? Or having an alternate reconstruction when centered schemes are used? Either way, I am for it, but don't know any better reconstruction schemes. Do you have any references I could check out? CFD wiki is best I've got so far 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I am surprised this didn't change more regression tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is all I think the logic needs. Central schemes never use MUSCL so it did not make sense to reconstruct the flow variables using gradients.
This only affects cases with central schemes + MUSCL turbulence (we seem to have two :) )
In theory what we need here is not a reconstruction, but a recomputation of the mass flux that is consistent with the main solver. I played with that once here #721 #726, by storing the mass flux from the convective residual loop. It made convergence worse since it un-staggers the solution (block Gauss-Seidel vs block Jacobi) so I did not pursue it further...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, I didn't see the && muscl in the declaration of musclFlow.
I am not sure why you would need a recomputation of the mass flux as opposed to flow variables. The reconstructions in the EulerSolver are identical to that in TurbSolver (except for things like the low mach corrections and non-physical points). I can see an argument made for copying all those corrections in the TurbSolver as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want the mass fluxes to respect the discretized continuity equation then they should be consistent with what is computed by the flow solver (otherwise you are introducing a source term that is proportional to the mass imbalance). In any case this did not seem to make a huge difference... But it is the recommended approach especially for some incompressible methods (where the mass flux is kept separate anyway for other reasons).
talbring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked through everything. Nothing to complain. Thanks @pcarruscag Also for the small (unrelated) improvements in between!
| void CEulerSolver::SetUndivided_Laplacian(CGeometry *geometry, CConfig *config) { | ||
| void CEulerSolver::SetUndivided_Laplacian_And_Centered_Dissipation_Sensor(CGeometry *geometry, CConfig *config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…fallback_strategies
jayantmukho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can continue the conversation on changing reconstruction on #905 if you want to pull this PR in.
| if (!smooth_ready) { | ||
| SU2_OMP_BARRIER | ||
| SU2_OMP_MASTER | ||
| { | ||
| auto nVar = b.GetNVar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I think I did see this happen a couple of times when I briefly tested the multi-threading. I built SU2 with -Dwith-omp=true
It happened when I didn't specify a number of threads with the -t option. But it ran okay if I specified even -t 1. Does this fix that? Or do you have to specify a thread count if you build with OMP?
| const bool muscl = config->GetMUSCL_Turb(); | ||
| const bool limiter = (config->GetKind_SlopeLimit_Turb() != NO_LIMITER); | ||
|
|
||
| /*--- Only reconstruct flow variables if MUSCL is on for flow (requires upwind) and turbulence. ---*/ | ||
| const bool musclFlow = config->GetMUSCL_Flow() && muscl && | ||
| (config->GetKind_ConvNumScheme_Flow() == SPACE_UPWIND); | ||
| /*--- Only consider flow limiters for cell-based limiters, edge-based would need to be recomputed. ---*/ | ||
| const bool limiterFlow = (config->GetKind_SlopeLimit_Flow() != NO_LIMITER) && | ||
| (config->GetKind_SlopeLimit_Flow() != VAN_ALBADA_EDGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you have revisited some of the logic here already?
I haven't looked at the reconstructions before. Before these changes the MUSCL reconstruction for all (flow and turbulent) variables was only done when MUSCL_TURB= YES (when MUSCL reconstruction was requested for the turbulence solver as well).
Here you have changed it such that the flow variables get reconstructed if MUSCL_FLOW = YES but the turbulent variables are treated as before, i.e. they get reconstructed only when MUSCL_TURB= YES.
Are you suggesting changing the reconstruction itself? Or having an alternate reconstruction when centered schemes are used? Either way, I am for it, but don't know any better reconstruction schemes. Do you have any references I could check out? CFD wiki is best I've got so far 😛
| % --------------------- HYBRID PARALLEL (MPI+OpenMP) OPTIONS ---------------------% | ||
| % | ||
| % An advanced performance parameter for FVM solvers, a large-ish value should be best | ||
| % when relatively few threads per MPI rank are in use (~4). However, maximum parallelism | ||
| % is obtained with EDGE_COLORING_GROUP_SIZE=1, consider using this value only if SU2 | ||
| % warns about low coloring efficiency during preprocessing (performance is usually worse). | ||
| % Setting the option to 0 disables coloring and a different strategy is used instead, | ||
| % that strategy is automatically used when the coloring efficiency is less than 0.875. | ||
| % The optimum value/strategy is case-dependent. | ||
| EDGE_COLORING_GROUP_SIZE= 512 | ||
| % | ||
| % Independent "threads per MPI rank" setting for LU-SGS and ILU preconditioners. | ||
| % For problems where time is spend mostly in the solution of linear systems (e.g. elasticity, | ||
| % very high CFL central schemes), AND, if the memory bandwidth of the machine is saturated | ||
| % (4 or more cores per memory channel) better performance (via a reduction in linear iterations) | ||
| % may be possible by using a smaller value than that defined by the system or in the call to | ||
| % SU2_CFD (via the -t/--threads option). | ||
| % The default (0) means "same number of threads as for all else". | ||
| LINEAR_SOLVER_PREC_THREADS= 0 | ||
| % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| const bool muscl = config->GetMUSCL_Turb(); | ||
| const bool limiter = (config->GetKind_SlopeLimit_Turb() != NO_LIMITER); | ||
|
|
||
| /*--- Only reconstruct flow variables if MUSCL is on for flow (requires upwind) and turbulence. ---*/ | ||
| const bool musclFlow = config->GetMUSCL_Flow() && muscl && | ||
| (config->GetKind_ConvNumScheme_Flow() == SPACE_UPWIND); | ||
| /*--- Only consider flow limiters for cell-based limiters, edge-based would need to be recomputed. ---*/ | ||
| const bool limiterFlow = (config->GetKind_SlopeLimit_Flow() != NO_LIMITER) && | ||
| (config->GetKind_SlopeLimit_Flow() != VAN_ALBADA_EDGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I am surprised this didn't change more regression tests.
Proposed Changes
The hybrid parallel implementation uses grid coloring (edge and element) to avoid race conditions when building the residual vector and Jacobian matrix.
Depending on grid size, type, and ordering, it may be difficult to find a good coloring, i.e. providing locality and enough parallelism (all threads getting almost the same number of work chunks to work on per color).
This PR extends the use of the "reducer strategy" to fine grids (introduced and described in #861 for coarse grids, which before #894 I thought would be the only problematic scenarios) as a fallback strategy that is automatically selected when the coloring efficiency (highlighted below) drops below a threshold value. The reducer strategy can be forced by setting EDGE_COLORING_GROUP_SIZE to 0 (for some cases it seems to give better performance, contrary to the initial assessment from #789).
For element loops (CFEASolver and CMeshSolver, modified in #843) an automatic fallback was also introduced, but using "locks" (explained below) as a "reducer strategy" would use too much memory, and element loops are more compute intensive (and therefore less affected by the overhead of using these locks).
Related Work
Part of #789, namely:
Improves scaling of the work from #861 #843 (should help with #894)
Fixes a bug from #830
PR Checklist