Hybrid parallel Gradients and Limiters #834

pcarruscag · 2019-12-10T16:38:09Z

Proposed Changes

Compute Green-Gauss gradients and limiters via a point-loop (like we do for Least Squares), in the limit this allows one thread per point of parallelization. Parallel regions are started inside the functions and again only the master thread calls MPI routines.

Move these routines out of CSolver and make them general to reduce code duplication (compressible/incompressible, solution/primitives), and to improve test-ability (if key areas of the code depend less on large objects we can more easily treat them as units).

Introduce a way to create different limiters by defining small details like the limiter function (and its associated constants), or any geometry-based modification.

I don't seem to have "hard" failures in the regressions but I need to run tests, especially for periodic cases as I had to change the "communication model" for the min / max computed for limiters.
I will start documenting anyway.

Related Work

#789 #824
Depends on #830

PR Checklist

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
My contribution is commented and consistent with SU2 style.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

… cpp

…allel_and_SIMD

pcarruscag · 2019-12-10T17:09:27Z

SU2_CFD/include/gradients/computeGradientsGreenGauss.hpp

+void computeGradientsGreenGauss(CSolver* solver,
+                                MPI_QUANTITIES kindMpiComm,
+                                PERIODIC_QUANTITIES kindPeriodicComm,
+                                CGeometry& geometry,
+                                CConfig& config,
+                                const FieldType& field,
+                                size_t varBegin,
+                                size_t varEnd,
+                                GradientType& gradient)


The signature for these functions is a bit long... But by making the solver+mpi optional we can more easily devise unit tests or use the function for some other purpose.
I have also made these function able to compute gradients for only part of the variables, the idea is to use that to recompute only the velocity gradients before computing the turbulence source terms.

I like the ability to choose variables to compute gradients for. Should trim some unnecessary computations.

pcarruscag · 2019-12-10T17:16:13Z

SU2_CFD/include/limiters/CLimiterDetails.hpp

+template<ENUM_LIMITER LimiterKind>
+struct CLimiterDetails


The way I've come up with to create new limiters without copy pasting a similar implementation is to specialise this class template. I did not want to make this polymorphic.

pcarruscag · 2019-12-10T17:26:22Z

SU2_CFD/include/limiters/CLimiterDetails.hpp

+  inline su2double raisedSine(su2double dist)
+  {
+    su2double factor = 0.5*(1.0+dist+sin(PI_NUMBER*dist)/PI_NUMBER);
+    return max(0.0, min(factor, 1.0));
+  }


Initially I took this function from the sharp-edge limiter but I think it was incorrect (the way the distance was normalized and the bounds checking (replaced by min/max) did not seem to be continuous).
Now for distances less than the reference the "geometric limiter" is 0, at 3 times the distance the value is 1.

pcarruscag · 2019-12-10T17:28:05Z

SU2_CFD/include/limiters/computeLimiters.hpp

+    case VENKATAKRISHNAN:
+    {
+      INSTANTIATE(VENKATAKRISHNAN);
+      break;
+    }


After specialising that class it is just a matter of adding the new case to this switch.
I tried to have this in a .cpp with explicit instantiation to avoid compiling the code for CSolver, CEulerSolver and CIncEulerSolver, but I was getting undefined references for AD builds.

pcarruscag · 2019-12-10T17:32:55Z

SU2_CFD/include/limiters/computeLimiters_impl.hpp

+template<class FieldType, class GradientType, ENUM_LIMITER LimiterKind>
+void computeLimiters_impl(CSolver* solver,


And this is the general limiter implementation, the idea is that this can stay independent of the particular details of each method.
For each point we loop over its direct neighbours, as explained in #789 this is faster than looping by edges as the computation is more local (more flops, less bytes), and it is more parallelizable.

pcarruscag · 2019-12-10T17:41:15Z

SU2_CFD/include/variables/CAdjEulerVariable.hpp

+  /*!
+   * \brief Get the reconstruction gradient for variables at all points.
+   * \return Reference to reconstruction gradient.
+   */
+  inline VectorOfMatrix& GetGradient_Reconstruction(void) final { return Gradient_Reconstruction; }
+


To allow the new routines to work on primitives / solution CVariable needs to give access to the entire container... Not because the routine depends on a specific type of container, but to bypass the names of the AddGradient_XX SubtractGradient_XX methods.
To eventually vectorize the gradient and limiter computation only the container will need to implement "vector-accessors".

pcarruscag · 2019-12-10T17:43:32Z

Common/include/option_structure.hpp

+  PERIODIC_NONE       = 99,  /*!< \brief No periodic communication required. */
  PERIODIC_VOLUME     =  1,  /*!< \brief Volume communication for summing total CV (periodic only). */


@economon the AuxVar did not have periodic gradient communications, so I had to add a do-nothing mode.

…adients_limiters

pcarruscag · 2020-01-13T15:45:07Z

This just turned a month old :) If I could have some reviews please :)
I'll post test results by the end of the week for the failing regressions.

…adients_limiters

pcarruscag · 2020-01-13T23:49:19Z

The residuals (and indeed the converged results) change for some cases due to these changes (by very small amounts). Looking at limiter fields there are some differences that justify this. Those differences are present mostly in smooth flow regions, why in this regions?
Because this is where the old edge loops had a lot of logic to avoid divisions by zero, that is not required for the point loop implementation. There was also a change to the epsilon that initializes the min/max calculation (it now comes from the traits of the datatype).
I used the transonic_stator case to check that:

The computed limiters are not too different, especially near periodic boundaries (for which a change in how min/max neighbor calculation was required).
The results do not depend on number of ranks / threads.

This is the flow field (for reference):

This shows the x-mom limiters near the trailing edge for the develop version:

And this for this branch:

You can tell a slight "darkening" close to the blue spots, and also before the small spot close to the wall, before the shock. Near the periodic boundaries there are no visible changes.
I ran the case with 1 rank 1 thread (1/1), 1/4 and 4/8 (32 total), the results are the same (the convergence rate is different due to the behavior of the preconditioner).
files.zip

If you would like a specific test, or see some weird behavior in one of your cases let me know, but since only a few regressions changed I will only post comparisons of cases with substantial changes.

pcarruscag · 2020-01-16T19:37:16Z

Of the two cases with larger residual changes:

contadj_euler_naca0012 - No idea why they changed, neither primal nor adjoint compute limiters... the primal residuals are unchanged, and the case converges to the same values (residuals and solution) so I simply updated the residuals
transonic_stator_restart - As shown above the case is fine, so I updated the restart file, however I do not know how to change the testcases branch anymore :) but I guess once the corresponding PR is merged this will start passing.

I ran some other tests with the Venkatakrishnan-Wang limiter (which requires a global min/max) and is does not seem to be covered by the tests ATM (maybe I'll use that restart case to fix that), everything looks perfect, same results with different ranks/threads and so on, the results are tens of MB so I won't upload unless someone wants to double check.

jayantmukho

As always, a super clean implementation that'll help trim some code and some unnecessary calculations.

I can't pretend to follow how the OpenMP commands work but if it's faster and the solutions are the same, then I'm sold.

jayantmukho · 2020-01-17T04:52:42Z

SU2_CFD/include/gradients/computeGradientsGreenGauss.hpp

+void computeGradientsGreenGauss(CSolver* solver,
+                                MPI_QUANTITIES kindMpiComm,
+                                PERIODIC_QUANTITIES kindPeriodicComm,
+                                CGeometry& geometry,
+                                CConfig& config,
+                                const FieldType& field,
+                                size_t varBegin,
+                                size_t varEnd,
+                                GradientType& gradient)


I like the ability to choose variables to compute gradients for. Should trim some unnecessary computations.

jayantmukho · 2020-01-17T05:02:05Z

SU2_CFD/include/limiters/CLimiterDetails.hpp

+   * \brief Set a small epsilon to avoid divisions by 0.
+   */
+  template<class... Ts>
+  inline void preprocess(Ts&...) {eps2 = LimiterHelpers::epsilon();}


Had to look up what the ... was for. Didn't know about parameter packs, very cool.
Also super clean implementation of limiters.

Thank you for the review @jayantmukho, to be honest I'm no specialist either, I am only using them to make those methods forward compatible in case some of the specializations need to be passed extra parameters in the future.

jayantmukho · 2020-01-17T05:14:54Z

SU2_CFD/src/drivers/CDriver.cpp

 void CFluidDriver::Output(unsigned long InnerIter) {

+  for (iZone = 0; iZone < nZone; iZone++) {
+    const auto inst = config_container[iZone]->GetiInst();


Is there a reason to use auto as opposed to a specific type?

Also were different instantiations just not being output before you added these lines?

Mostly because I find boring to write unsigned blabla all the time. It is a recommended practice in the book Effective Modern C++ (but for other reasons).

Neither zones nor instances were outputted before.

is this still a problem in the SingleZoneDriver and MultiZoneDriver as well?

Its not really related to this PR, but might be worth addressing

To my knowledge only the HBDriver deals with multiple time instances... so it should not be a problem for anything else.

jayantmukho · 2020-01-17T05:19:00Z

SU2_CFD/src/solver_structure.cpp

-  if (TapeActive) AD::StartRecording();
-#endif
+  computeLimiters(kindLimiter, this, SOLUTION_LIMITER, PERIODIC_LIM_SOL_1, PERIODIC_LIM_SOL_2,
+                  *geometry, *config, 0, nVar, solution, gradient, solMin, solMax, limiter);


very satisfying deletion of duplicated code over the solver files 😃

SU2_CFD/include/limiters/computeLimiters_impl.hpp

Common/include/toolboxes/C2DContainer.hpp

pcarruscag

Here is some explanation of the OpenMP stuff.

pcarruscag · 2020-01-17T11:36:34Z

SU2_CFD/include/gradients/computeGradientsGreenGauss.hpp

+
+  /*--- Start OpenMP parallel section. ---*/
+
+  SU2_OMP_PARALLEL


This starts a parallel section, the code within {} that comes after is executed simultaneously by the threads.

pcarruscag · 2020-01-17T11:43:01Z

SU2_CFD/include/gradients/computeGradientsGreenGauss.hpp

+    SU2_OMP_FOR_DYN(chunkSize)
+    for (size_t iPoint = 0; iPoint < nPointDomain; ++iPoint)


This divides the loop over however many threads arrive here, without the worksharing construct the entire loop would be performed by each thread.
For reasons of spatial locality it is beneficial to divide the loop in chunks instead of individual iterations, in this case those chunks are assigned to each thread dynamically, this helps with load balancing but has some overhead. That is why you also see static work division for light loops (e.g. setting an array to 0).

what is the chunkSize dependent on? Is it hardware? number of variables?

Does simply adding SU2_OMP_FOR_DYN(chunkSize) before the for loop automatically run the loop in chunks? (assuming you add relevant includes and calls beforehand)

Does anything new need to be done in how the code is compiled or run to ensure that multiple threads are used?

The size is usually based on amount of work per loop iteration, if the loop is heavy a small size can be used to improve load balancing, otherwise a relatively large size needs to be used to amortize the parallelization overheads. Currently most sizes are determined such that each thread gets on average the same number of chunks but each chunk is no larger than a maximum size, there is an utility function in omp_structure.hpp for this.

Yes, distributing the work over the threads that arrive at that location, if only one arrives the code still executes properly.

The build system was updated in #843 since that introduces the first solver that is completely hybrid parallel, there is no advantage in using the feature for fluid solvers yet.
Basically you add -Dwith-omp=true to meson, the number of threads per rank can be specified with the -t option when calling SU2_CFD, e.g. mpirun -n 2 --bind-to numa SU2_CFD -t 8 config.cfg if -t is omitted the number of threads is whatever is defined for the environment.

pcarruscag · 2020-01-17T11:44:29Z

SU2_CFD/include/gradients/computeGradientsLeastSquares.hpp

+
+    if (solver != nullptr)
+    {
+      SU2_OMP_MASTER


This is the OpenMP equivalent of rank == MASTER_NODE.

SU2_CFD/include/gradients/computeGradientsLeastSquares.hpp

pcarruscag added 14 commits December 6, 2019 22:21

start generic GG gradient function

b79d3d9

finish replacing all GG uses by generic function

2b9b02e

generic function for LS gradients

f790bc3

delete unused "gradient manipulation" methods

0863570

start generic limiter framework

93d4404

finish Venkatakrishnan-Wang

60ca7e8

wrapper function to call limiter computation, compilation in only one…

527b3f1

… cpp

update periodic comm strategy for min/max computation

658b186

fix parameters being passed by value

174aafc

Merge branch 'hybrid_parallel_linear_algebra' into feature_hybrid_par…

0f29089

…allel_and_SIMD

update build systems

36aca3a

fix namespace problems for AD

eeda2e4

fix AD build, return matrices by ref instead of by pointer

dc97616

improve consistency wrt edge-loop algorithm

83fa741

pcarruscag added dependent changelog:none labels Dec 10, 2019

pr-triage bot added the PR: unreviewed label Dec 10, 2019

pcarruscag commented Dec 10, 2019

View reviewed changes

pcarruscag mentioned this pull request Dec 21, 2019

Hybrid parallel CFEASolver and CMeshSolver #843

Merged

5 tasks

Merge branch 'hybrid_parallel_linear_algebra' into hybrid_parallel_gr…

eac1602

…adients_limiters

pcarruscag mentioned this pull request Jan 13, 2020

Hybrid parallel (OpenMP) implementation for Linear Algebra classes #830

Merged

5 tasks

Merge branch 'hybrid_parallel_linear_algebra' into hybrid_parallel_gr…

8db56f5

…adients_limiters

pcarruscag changed the base branch from hybrid_parallel_linear_algebra to develop January 13, 2020 23:15

Merge branch 'develop' into hybrid_parallel_gradients_limiters

bf3aea4

pcarruscag and others added 2 commits January 14, 2020 20:03

update testcases with small changes

364b060

Merge branch 'develop' into hybrid_parallel_gradients_limiters

3bedcb2

pcarruscag removed the dependent label Jan 16, 2020

small testcase updates after merge with develop

a62519f

pcarruscag added 4 commits January 16, 2020 15:24

update transonic stator restart residuals

2332aa6

update cont_adj_euler_naca0012 case

49d2c8a

try to update testcases branch

b045156

revert changes to docker

f80e9cc

jayantmukho approved these changes Jan 17, 2020

View reviewed changes

pr-triage bot added PR: reviewed-approved and removed PR: unreviewed labels Jan 17, 2020

pcarruscag commented Jan 17, 2020

View reviewed changes

clearer variable name, change one case to cover Venkat-Wang limiter

d904867

pr-triage bot added PR: unreviewed and removed PR: reviewed-approved labels Jan 18, 2020

pcarruscag merged commit f80bbab into develop Jan 18, 2020

pr-triage bot added PR: merged and removed PR: unreviewed labels Jan 18, 2020

pcarruscag deleted the hybrid_parallel_gradients_limiters branch January 21, 2020 17:36

pcarruscag mentioned this pull request Mar 26, 2020

Spurious results with AUSM & Venkatakrishnan limiter over RAE2822 Airfoil #917

Closed

jblueh mentioned this pull request Mar 1, 2021

Hybrid Parallel AD (Part 1/?) #1214

Merged

5 tasks

		template<class FieldType, class GradientType, ENUM_LIMITER LimiterKind>
		void computeLimiters_impl(CSolver* solver,

		PERIODIC_NONE = 99, /!< \brief No periodic communication required. /
		PERIODIC_VOLUME = 1, /!< \brief Volume communication for summing total CV (periodic only). /

		SU2_OMP_FOR_DYN(chunkSize)
		for (size_t iPoint = 0; iPoint < nPointDomain; ++iPoint)

		template<ENUM_LIMITER LimiterKind>
		struct CLimiterDetails

Hybrid parallel Gradients and Limiters #834

Hybrid parallel Gradients and Limiters #834

Uh oh!

Conversation

pcarruscag commented Dec 10, 2019

Proposed Changes

Related Work

PR Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcarruscag commented Jan 13, 2020

Uh oh!

pcarruscag commented Jan 13, 2020

Uh oh!

pcarruscag commented Jan 16, 2020

Uh oh!

jayantmukho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcarruscag Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pcarruscag left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pcarruscag Jan 17, 2020 •

edited

Loading