Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPFAO model initialization takes a long time. #91

Open
jeff-cohere opened this issue Oct 5, 2020 · 26 comments
Open

MPFAO model initialization takes a long time. #91

jeff-cohere opened this issue Oct 5, 2020 · 26 comments

Comments

@jeff-cohere
Copy link
Collaborator

This has been on our radar for a while, so I'm creating an issue to track progress. Now that we've instrumented TDycore with timers, I'm able to see where the time is being spent in initialization.

Profiling Notes

I'm doing profiling in the jeff-cohere/mpfao_init_profiling branch. Here's what I'm running to profile the initialization issue:

cd demo/richards
make
./richards_driver -dim 3 -Nx 100 -Ny 100 -Nz 10 -tdy_timers -final_time 300

This runs a short-ish simulation with timers turned on. The resulting profile log, tdycore_profile.csv, can be loaded into a spreadsheet (or Pandas dataframe or whatever). So far, it looks like we're spending a lot of initialization time in these functions:

  • TDyDriverInitializeTDy (74 sec)
  • TDyCreateJacobian (48 sec)
  • DMCreateMat (48 sec)
  • DMPlexPrealloc (48 sec)
  • TDySetDiscretizationMethod (21 sec)
  • TDyMPFAOInitialize (21 sec)

The preallocation entry is telling. If we're not giving PETSc any clues about the non-zero structure matrix in our preallocation of the Jacobian, PETSc's probably doing a lot of work to figure it out on its own. My guess is that we can pass along some information to help it out. I've never used the DMPlex interface, though, so I'll have to look into what this means.

FYI @bishtgautam

@bishtgautam
Copy link
Member

@jeff-cohere It would be interesting to see how the initialization time is scaling with the number of grid cells.

@knepley, @jedbrown, Any ideas on how to decrease time spent in DMCreateMat? The relevant code is at

PetscErrorCode TDyCreateJacobian(TDy tdy) {

@knepley
Copy link

knepley commented Oct 6, 2020 via email

@jeff-cohere
Copy link
Collaborator Author

Here are some numbers for runs at different resolutions:

Nx Ny Nz TDyDriverInitializeTDy DMCreateMat TDyMPFAOInitialize
20 20 10 2.9 1.9 0.82
40 20 10 5.8 3.7 1.7
40 40 10 12 7.5 3.3
80 80 10 49 32 13
160 160 10 204 132 57

Looks like the initialization cost scales linearly with the problem size at these resolutions.

@bishtgautam
Copy link
Member

It is great that the initialization cost is scaling linearly, though DMCreateMatrix is about 65% of the cost.

@ghammond86
Copy link
Contributor

@jeff-cohere Any information regarding how the initialization scales in parallel, or did I miss something?

@jeff-cohere
Copy link
Collaborator Author

jeff-cohere commented Oct 7, 2020

I haven't done parallel profiling yet. I think I ran into a problem running this demo in parallel at first. I'll try again and see how it works, and then I'll do a little "scaling study."

That said, it seems to me that we might be able to improve these numbers even in the serial case.

@jeff-cohere
Copy link
Collaborator Author

jeff-cohere commented Oct 7, 2020

(deleted crash to reduce clutter in this issue--see #93 for details)

@bishtgautam
Copy link
Member

I will debug the error in parallel.

@knepley
Copy link

knepley commented Oct 9, 2020

@jeff-cohere I just ran this on my laptop. The numbers do not agree with the above. You can see that preallocation takes 6s, and the major expense in SetUp is ComputeAinvB for some reason. Also, there are some bogus events, so I think something is slightly wrong with the profiling setup.

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./richards_driver on a arch-master-opt named MacBook-Pro.fios-router.home with 1 processor, by knepley Fri Oct  9 14:00:40 2020
Using Petsc Development GIT revision: v3.13.4-671-g69d777391a  GIT Date: 2020-08-09 14:52:40 -0400

                         Max       Max/Min     Avg       Total
Time (sec):           4.508e+02     1.000   4.508e+02
Objects:              1.510e+02     1.000   1.510e+02
Flop:                 1.558e+10     1.000   1.558e+10  1.558e+10
Flop/sec:             3.456e+07     1.000   3.456e+07  3.456e+07
Memory:               7.979e+08     1.000   7.979e+08  7.979e+08
MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00     0.000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
 0:      Main Stage: 1.2042e+02  26.7%  1.5562e+10  99.9%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
 1:   TDycore Setup: 3.3033e+02  73.3%  1.4400e+07   0.1%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          1 1.0 2.0813e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMPlexStratify         1 1.0 8.9059e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMPlexSymmetrize       1 1.0 3.2729e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetGraph             1 1.0 6.9652e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp                1 1.0 1.4129e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFBcastOpBegin       175 1.0 2.2303e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFBcastOpEnd         175 1.0 4.2343e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFPack               175 1.0 3.5286e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFUnpack             175 1.0 2.4557e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult              924 1.0 3.5922e+00 1.0 5.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 38  0  0  0   3 38  0  0  0  1631
MatSolve             826 1.0 2.9674e+00 1.0 4.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 26  0  0  0   2 26  0  0  0  1356
MatLUFactorNum        77 1.0 3.0409e+00 1.0 1.66e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 11  0  0  0   3 11  0  0  0   546
MatILUFactorSym        1 1.0 1.5237e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin     154 1.0 3.0756e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd       154 1.0 4.1962e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.2875e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.6529e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries        77 1.0 7.1313e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecDot                77 1.0 6.7339e-01 1.0 1.54e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0    23
VecMDot              749 1.0 2.3295e-01 1.0 1.04e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  7  0  0  0   0  7  0  0  0  4446
VecNorm              995 1.0 3.0445e-02 1.0 1.99e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  6536
VecScale             898 1.0 7.1262e+00 1.0 8.28e+07 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   6  1  0  0  0    12
VecCopy              239 1.0 1.7945e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               292 1.0 1.9163e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               77 1.0 6.7493e-01 1.0 1.54e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0    23
VecWAXPY              78 1.0 8.4722e-03 1.0 7.90e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   932
VecMAXPY             826 1.0 2.2327e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  8  0  0  0   0  8  0  0  0  5310
VecReduceArith       154 1.0 7.7238e-03 1.0 3.08e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  3988
VecReduceComm         77 1.0 8.3923e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize         826 1.0 7.1348e+00 1.0 2.48e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0   6  2  0  0  0    35
ComputePlexGeometry    2 1.0 5.2172e-01 1.0 1.81e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    35
TDyResetDiscMethod     1 1.0 6.2351e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
Warning -- total time of event greater than time of entire stage -- something is wrong with the timer
SNESSolve             20 1.0 1.2109e+02 1.0 1.55e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27100  0  0  0 101100  0  0  0   128
SNESSetUp              1 1.0 1.1328e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SNESFunctionEval      98 1.0 1.2278e+01 1.0 1.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00  3 12  0  0  0  10 12  0  0  0   149
SNESJacobianEval      77 1.0 8.6496e+01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19  9  0  0  0  72  9  0  0  0    17
SNESLineSearch        77 1.0 1.0571e+01 1.0 1.92e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2 12  0  0  0   9 12  0  0  0   181
KSPSetUp              77 1.0 4.1515e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve              77 1.0 1.9853e+01 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 0.0e+00  4 76  0  0  0  16 76  0  0  0   595
KSPGMRESOrthog       749 1.0 4.3466e-01 1.0 2.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0 13  0  0  0   0 13  0  0  0  4766
PCSetUp               77 1.0 3.8116e+00 1.0 1.66e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 11  0  0  0   3 11  0  0  0   436
PCApply              826 1.0 2.9680e+00 1.0 4.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 26  0  0  0   2 26  0  0  0  1356
Warning -- total time of event greater than time of entire stage -- something is wrong with the timer
TDyTimeIntegRunToTime  1 1.0 1.2221e+02 1.0 1.55e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27100  0  0  0 101100  0  0  0   127
TDyPreSolveSNESSolver 20 1.0 9.1538e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
TDyMPFAOSNESPreSolv3D 20 1.0 9.0719e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
TDyMPFAOSNESFunc3D    98 1.0 1.0594e+01 1.0 1.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2 12  0  0  0   9 12  0  0  0   173
TDyMPFAOIFuncVert3D   98 1.0 1.7647e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
TDyMPFAOSNESJacob3D   77 1.0 8.5174e+01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19  9  0  0  0  71  9  0  0  0    17
TDyMPFAOIJacobVert3D  77 1.0 8.1565e+01 1.0 1.42e+09 1.0 0.0e+00 0.0e+00 0.0e+00 18  9  0  0  0  68  9  0  0  0    17
TDyPostSolvSNESSolver 20 1.0 1.2350e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: TDycore Setup

DMCreateMat            2 1.0 7.9277e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
DMPlexPrealloc         2 1.0 6.8801e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatAssemblyBegin       3 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         3 1.0 1.2811e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCopy                1 1.0 9.3937e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 7 1.0 4.8368e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSetRandom           1 1.0 1.6370e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyDriverInitializeTDy 1 1.0 3.2491e+02 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00 72  0  0  0  0  98100  0  0  0     0
TDySetDiscMethod       1 1.0 5.5725e+00 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2100  0  0  0     3
TDyResetDiscMethod     1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOInitialize     1 1.0 5.5725e+00 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2100  0  0  0     3
TDySaveClosures        1 1.0 2.9792e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Cells  1 1.0 1.0283e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Elem   3 1.0 2.9781e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Vert   1 1.0 1.0362e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Faces  1 1.0 9.1378e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyGetNumCellVertClsr  1 1.0 8.1451e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxCellSharVert     2 1.0 1.5831e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxElmTypeSharElm   4 1.0 3.2248e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxFaceSharVert     1 1.0 8.1780e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxEdgeSharVert     1 1.0 8.2569e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyUnitNormalTri 1872000 1.0 3.4206e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyNormalTri     4272000 1.0 2.1912e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyNormalQuad     312000 1.0 3.4739e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCmpVolTet      800000 1.0 4.6654e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyQuadArea      2400000 1.0 1.4797e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyTriangleArea  9600000 1.0 5.5103e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SetPermFromFunction    1 1.0 6.5720e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOAllocBdVal     1 1.0 2.5988e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOAllocSrcSnkVal 1 1.0 2.4414e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOSetup          1 1.0 3.0668e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
ComputeGMatrix         1 1.0 1.4023e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCmpGMatrix3DMesh    1 1.0 1.4021e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
CmpTransMat            1 1.0 3.0654e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
TDyCmpTransMat3DMesh   1 1.0 3.0654e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
CmpTransMatBdVert      8 1.0 7.5817e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
CmpTransMatCrnVert112203 1.0 3.0642e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
GetUpDnBdFaces    112203 1.0 8.8074e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ComputeCandFmat   112203 1.0 2.2279e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ExtrcsubFMatrices 224406 1.0 3.2624e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ExtrcsubCMatrices 224406 1.0 8.6959e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ComputeAinvB      112203 1.0 3.0217e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 67  0  0  0  0  91  0  0  0  0     0
TDyMPFAOSetFromOptions 1 1.0 8.1871e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SetPoroFromFunction    1 1.0 2.8300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCreateVectors       1 1.0 2.9350e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCreateJacobian      1 1.0 7.9277e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
TDySetSNESFunction     1 1.0 9.8397e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySetSNESJacobian     1 1.0 1.5974e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyUpdateState       118 1.0 5.4112e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0

--- Event Stage 2: Unknown


--- Event Stage 3: Unknown

@jedbrown
Copy link
Member

jedbrown commented Oct 9, 2020

Let me know if you'd like me to profile with more parallelism (I have a a 64-core node that's convenient for such things).

@jeff-cohere
Copy link
Collaborator Author

Thanks, guys. Matt, thanks for running this. I've not used PETSc's profiling tools before, so I'm not surprised to learn that I'm doing it wrong. :-) How did you get your performance report?

@knepley
Copy link

knepley commented Oct 9, 2020

You can use -log_view. The names are a bit long, so I chopped them all down to get this view. You might consider changing them in the source.

@jeff-cohere
Copy link
Collaborator Author

Okay, after updating to PETSc release 3.14 and building optimized(!), I've reproduced results more in line with what Matt has shown, though I don't get the messages about mismatched total times, and some of the times seem pretty different. Maybe I'll put together a script that generates easily-interpreted output that we can all run to make sure we're doing the same thing.

@jeff-cohere
Copy link
Collaborator Author

All right. Here's how things are looking for runs of 20x20x10 up through 160x160x10 using 1 to 4 processes (run until simulation time t = 3, so mostly problem setup). PETSc 3.14, built optimized. It seems like the setup stuff is scaling linearly, though the TDyDriverInitializeTDy call looks like it scales slightly worse.

I can investigate on another box tomorrow with some larger runs. Does this look right to everyone?

@jeff-cohere
Copy link
Collaborator Author

Just some notes here from Gautam for myself, since I'm getting back to this task:

  1. To run a large with TDycore:
export N=100
./transient_snes_mpfaof90 \
-tdy_water_density                 exponential \
-tdy_mpfao_gmatrix_method          MPFAO_GMATRIX_TPF \
-tdy_mpfao_boundary_condition_type MPFAO_NEUMANN_BC \
-max_steps 1 -ic_value 90325 \
-nx $N -ny $N -nz $N
  1. PFLOTRAN
    Problem setup
    -pflotranin performance.in
    Modify NXYZ to increase DOFs.

The physical problem setup (i.e. perm. value, BC) is not the same for two models. We are only interested in performance of model setup, so the differences in problem setup can be neglected.

@jeff-cohere
Copy link
Collaborator Author

jeff-cohere commented Jul 23, 2021

Latest update (@bishtgautam , you have these results already--I'm just sticking them here): seems like we're spending most of our time computing cell geometry. Maybe we can take a look and see if we can avoid recomputing where it's unnecessary.

I've added some additional timers in the mpfao-init-profiling branch.

Running:

mpiexec -n 8 ./transient_snes_mpfaof90 \
-tdy_water_density                 exponential \
-tdy_mpfao_gmatrix_method          MPFAO_GMATRIX_TPF \
-tdy_mpfao_boundary_condition_type MPFAO_NEUMANN_BC \
-max_steps 1 -ic_value 90325 \
-nx 100 -ny 100 -nz 10 \
-tdy_timers

Profiling results (running top20 with tdyprof):

          Stage Name                               Event Name             Time             FLOP
          Main Stage                      ComputePlexGeometry           116.53      2.58312e+06
          Main Stage             DMPlexComputeCellGeometryFVM          116.482      2.58312e+06
          Main Stage                 TDySetupNumericalMethods          88.3628      1.92744e+06
       TDycore Setup                       TDyMPFAOInitialize          88.3627      1.92744e+06
       TDycore Setup                             TDyBuildMesh          80.4514      2.05462e+06
       TDycore Setup                   SetupSubcellsFor3DMesh          79.6353      2.04336e+06
       TDycore Setup DMPlexComputeCellGeometryFVM (SetupSubce          78.4626      2.04336e+06
          Main Stage                                SNESSolve          70.9354      3.27254e+10
          Main Stage                                 KSPSolve          63.7017      3.26548e+10
          Main Stage                         DMPlexDistribute          36.8913               0.
          Main Stage                           Mesh Partition          34.7262               0.
          Main Stage                          DMPlexDistOvrlp           32.461               0.
          Main Stage                         DMPlexPartStrtSF          32.1641               0.
          Main Stage                                  MatMult          25.2142      1.00584e+10
          Main Stage                                  PCApply          23.3853      9.60692e+09
          Main Stage                                 MatSolve          22.7797      9.88347e+09
          Main Stage                           KSPGMRESOrthog          14.7776       1.2208e+10
          Main Stage                        TDyCreateJacobian          10.3166               0.
          Main Stage                              DMCreateMat          10.3165               0.
          Main Stage                           DMPlexPrealloc          10.3097               0.

@knepley
Copy link

knepley commented Jul 23, 2021

Hmm, it seems like ComputeCellGeometryFVM() is taking a very long time. That is very strange. Is it being called once? Also, it seems like all the time in distribution comes from stratifying the SF. @jedbrown maybe we need to look at that. I will try to get a clean timing like this in a PETSc example.

@jedbrown
Copy link
Member

That would be really helpful, @knepley. We can run tests on Noether (64 cores with good debug env).

@jeff-cohere
Copy link
Collaborator Author

jeff-cohere commented Jul 23, 2021

Here are the calls to DMPlexComputeCellGeometryFVM:

  • Here, within ComputePlexGeometry (first entry in the timing profile)
  • Here, within SetupSubcellsFor3DMesh (second entry)

@knepley
Copy link

knepley commented Jul 23, 2021

One thing that would be helpful is to include the event count with the timing summary above.

@jeff-cohere
Copy link
Collaborator Author

One thing that would be helpful is to include the event count with the timing summary above.

Sure. I'm not near the machine that produced these timings today, but I'll try to get you a report with event counts soon.

@bishtgautam
Copy link
Member

@knepley DMPlexComputeCellGeometryFVM is being called cell-by-cell

@knepley
Copy link

knepley commented Jul 27, 2021

But do we only loop over cells once?

@bishtgautam
Copy link
Member

We only loop over all cells once.

@bishtgautam
Copy link
Member

@jeff-cohere When you get a chance, could you rebase mpfao-init-profiling? thanks.

@jeff-cohere
Copy link
Collaborator Author

@jeff-cohere When you get a chance, could you rebase mpfao-init-profiling? thanks.

Rebased.

@jeff-cohere jeff-cohere removed their assignment Jun 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants