Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Patch Arrays - larger nclmax and nlevleaf #1198

Merged
merged 24 commits into from
Aug 14, 2024

Conversation

rgknox
Copy link
Contributor

@rgknox rgknox commented May 10, 2024

Description:

This set of changes makes a bunch of arrays attached to the patch structure dynamically allocated. Most of these arrays are dimensioned by: number-of-canopy-layers x number-of-pfts x number-of-veg-layers.

Previously, in the interest of keeping these arrays small, we had elected to keep nclmax equal to 2 and nlevleaf = 30, and manually increase this value when necessary. We still have these two constants, but they are used to either allocate stack space, or smaller dimension arrays, so we can bump them up to numbers that won't affect peoples runs.

So now, nclmax = 5 and nlevleaf = 50. We could bump them up higher if need be, I'm keeping them low because they still affect the size of history output arrays. But, users can set the history dimensionality to level 1 to get rid of large arrays in their output anyway.

Collaborators:

This has been a refactor target for years, so lots of people have probably weighed in.

Expectation of Answer Changes:

No answer changes but... the canopy structure part of the code is super sensitive to order of operation changes, so I do actually expect very small diffs.

Checklist

If this is your first time contributing, please read the CONTRIBUTING document.

All checklist items must be checked to enable merging this pull request:

Contributor

  • The in-code documentation has been updated with descriptive comments
  • The documentation has been assessed to determine if updates are necessary

Integrator

  • FATES PASS/FAIL regression tests were run
  • Evaluation of test results for answer changes was performed and results provided

Documentation

Test Results:

CTSM (or) E3SM (specify which) test hash-tag:

CTSM (or) E3SM (specify which) baseline hash-tag:

FATES baseline hash-tag:

Test Output:

@rgknox rgknox changed the title Dynamic Patch Arrays - larger nclmax Dynamic Patch Arrays - larger nclmax and nlevleaf May 10, 2024
@rgknox rgknox requested a review from mpaiao May 10, 2024 17:33
@rgknox rgknox added the draft label May 10, 2024
@rgknox
Copy link
Contributor Author

rgknox commented May 10, 2024

There are a few patch level variables that have weird names I'd also like to change:

patch%ncl_p : The _p part indicates patch, but that is implied by being on the patch structure, I'd like to change this so "ncan"

patch%ncan(:,:) : This is not the number of canopy layers! Its the number of vegetation layers in each canopy by pft class! I'd like to change this to "nveg(:,:)

patch%nrad(:,:) : This should be removed, we don't use it!

Copy link
Contributor

@mpaiao mpaiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for addressing this @rgknox, I went through your changes and they all look good! I am looking forward to testing this new code and see if it will help sustaining a denser understory in FATES.

else
currentPatch%NCL_p = z
end if

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice to have the informative error message, but I wonder if error as opposed to cohort termination is something we always want to do from now on. My only concern is if this could trigger too many error messages in global runs or parameter sensitivity experiments.

But we can see if this becomes a problem and address if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to allowing termination here. I suppose it will allow the model to continue working when people are testing strange edge-case parameter combinations that generate more than 5 canopy layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpaiao , I looked at this again. The calls to termination that precede this section should be ensuring that z <= nclmax. If z is larger than nclmax it should be in error at this point.

@rgknox rgknox removed the draft label May 13, 2024
@glemieux glemieux assigned rgknox and glemieux and unassigned glemieux Jun 3, 2024
@rgknox
Copy link
Contributor Author

rgknox commented Jun 10, 2024

I compared the timing output for simulations at BCI that use this method and the old method using a cap of 3 canopy layers. The simulations had no noticeable difference in run time.

@glemieux glemieux self-requested a review June 17, 2024 18:45
Copy link
Contributor

@glemieux glemieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. The (re)allocation code was pretty straight-foward and well commented. I only had one question below about zeroing some of the dynamics values.

@glemieux
Copy link
Contributor

Regression testing underway on derecho

@glemieux
Copy link
Contributor

glemieux commented Jul 19, 2024

Regression testing against fates-sci.1.77.1_api.36.0.0-ctsm5.2.013 is showing two RUN failures on derecho:

FAIL ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdPRT2 RUN time=60
FAIL ERS_P128x1_Lm25.f10_f10_mg37.I2000Clm60Fates.derecho_intel.clm-FatesColdNoComp RUN time=573

The FatesColdPRT2 test is failing with the following error per the cesm.log:

 20 dec0207.hsn.de.hpc.ucar.edu 378: forrtl: severe (408): fort: (7): Attempt to use pointer CPATCH when it is not associated with a target
 21 dec0207.hsn.de.hpc.ucar.edu 378:
 22 dec0207.hsn.de.hpc.ucar.edu 378: Image              PC                Routine            Line        Source
 23 dec0207.hsn.de.hpc.ucar.edu 378: cesm.exe           00000000018DFCBA  edphysiologymod_m        3323  EDPhysiologyMod.F90
 24 dec0207.hsn.de.hpc.ucar.edu 378: cesm.exe           00000000017BF521  edmainmod_mp_ed_i         692  EDMainMod.F90
 25 dec0207.hsn.de.hpc.ucar.edu 378: cesm.exe           00000000017B9DC2  edmainmod_mp_ed_e         220  EDMainMod.F90
 26 dec0207.hsn.de.hpc.ucar.edu 378: cesm.exe           0000000000B04EA2  clmfatesinterface        1259  clmfates_interfaceMod.F90
 27 dec0207.hsn.de.hpc.ucar.edu 378: cesm.exe           0000000000A7B8B5  clm_driver_mp_clm        1142  clm_driver.F90
 28 dec0207.hsn.de.hpc.ucar.edu 378: cesm.exe           000000000099F3B7  lnd_comp_nuopc_mp         904  lnd_comp_nuopc.F90

The FatesColdNoComp test cesm.log stack trace is a little more obscure and doesn't point directly to a potion of the fates code:

2370077 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           000000000109E3CD  shr_abort_mod_mp_         114  shr_abort_mod.F90
2370078 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           00000000005E94B1  abortutils_mp_end          98  abortutils.F90
2370079 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           0000000000E29168  ch4mod_mp_ch4_tra        4186  ch4Mod.F90
2370080 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           0000000000E1A6E6  ch4mod_mp_ch4_           2094  ch4Mod.F90
2370081 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           00000000005F8DE8  clm_driver_mp_clm        1203  clm_driver.F90
2370082 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           000000000059E67E  lnd_comp_nuopc_mp         904  lnd_comp_nuopc.F90

The lnd.log file doesn't provide much context aside from the fact that it failed about 2/3 of the way through the initial case.

Test results can be found: /glade/derecho/scratch/glemieux/ctsm-tests/tests_pr1198

@rgknox there are DIFFs, which I think are as expected, although I only spot checked a few. That said, I'm going to rerun the baseline to make sure that the one I was comparing against was generated correctly.

@glemieux
Copy link
Contributor

glemieux commented Jul 19, 2024

@rgknox there are DIFFs, which I think are as expected, although I only spot checked a few. That said, I'm going to rerun the baseline to make sure that the one I was comparing against was generated correctly.

It looks like the old baseline I had tested against was not with the latest tags. I've moved that one and regenerated fates-sci.1.77.1_api.36.0.0-ctsm5.2.013 with the appropriate tags checked out. I've got fates suite tests rerunning against the newly generated baseline.

@glemieux
Copy link
Contributor

glemieux commented Jul 19, 2024

I realized I had missed in the FatesColdNoComp failing test that there was a more illuminating error message:

26413 dec0379.hsn.de.hpc.ucar.edu 120:  energy balance in canopy           24 , err= -0.973966808913889
26414 dec0379.hsn.de.hpc.ucar.edu 120:  Negative conc. in ch4tran. c,j,deficit (mol):           2           4
26415 dec0379.hsn.de.hpc.ucar.edu 120:   1.119271163016430E-003
26416 dec0379.hsn.de.hpc.ucar.edu 120:  Negative conc. in ch4tran. c,j,deficit (mol):           2           5
26417 dec0379.hsn.de.hpc.ucar.edu 120:   3.536220161397217E-003
26418 dec0379.hsn.de.hpc.ucar.edu 120:  Negative conc. in ch4tran. c,j,deficit (mol):           2           6
26419 dec0379.hsn.de.hpc.ucar.edu 120:   7.176459108002257E-003
26420 dec0379.hsn.de.hpc.ucar.edu 120:  Note: sink > source in ch4_tran, sources are changing  quickly relative to diff
26421 dec0379.hsn.de.hpc.ucar.edu 120:  usion timestep, and/or diffusion is rapid.
26422 dec0379.hsn.de.hpc.ucar.edu 120:  Latdeg,Londeg=   80.0000000000000        285.000000000000
26423 dec0379.hsn.de.hpc.ucar.edu 120:  This typically occurs when there is a larger than normal  diffusive flux.
26424 dec0379.hsn.de.hpc.ucar.edu 120:  If this occurs frequently, consider reducing land model (or  methane model) tim
26425 dec0379.hsn.de.hpc.ucar.edu 120:  estep, or reducing the max. sink per timestep in the methane model.
26426 dec0379.hsn.de.hpc.ucar.edu 120:  Negative conc. in ch4tran. c,j,deficit (mol):           2           7
26427 dec0379.hsn.de.hpc.ucar.edu 120:   1.185680703304796E-002
26428 dec0379.hsn.de.hpc.ucar.edu 120:  Negative conc. in ch4tran. c,j,deficit (mol):           2           8
26429 dec0379.hsn.de.hpc.ucar.edu 120:   8.928985367626268E-003
26430 dec0379.hsn.de.hpc.ucar.edu 120:  CH4 Conservation Error in CH4Mod during diffusion, nstep, c, errch4 (mol /m^2.t
26431 dec0379.hsn.de.hpc.ucar.edu 120:  imestep)       25298           2 -3.813536517318042E-002
26432 dec0379.hsn.de.hpc.ucar.edu 120:  Latdeg,Londeg=   80.0000000000000        285.000000000000
26433 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: local  column   index = 2
26434 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: global column   index = 1730
26435 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: global landunit index = 568
26436 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: global gridcell index = 249
26437 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: gridcell longitude    =  285.0000000
26438 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: gridcell latitude     =   80.0000000
26439 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: column   type         = 1
26440 dec0379.hsn.de.hpc.ucar.edu 120: iam = 120: landunit type         = 1
26441 dec0379.hsn.de.hpc.ucar.edu 120:  ENDRUN:
26442 dec0379.hsn.de.hpc.ucar.edu 120:  ERROR:
26443 dec0379.hsn.de.hpc.ucar.edu 120:   ERROR: CH4 Conservation Error in CH4Mod during diffusionERROR in ch4Mod.F90 at
26444 dec0379.hsn.de.hpc.ucar.edu 120:   line 4188
26445 dec0379.hsn.de.hpc.ucar.edu 120: Image              PC                Routine            Line        Source
26446 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           000000000109E3CD  shr_abort_mod_mp_         114  shr_abort_mod.F90
26447 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           00000000005E94B1  abortutils_mp_end          98  abortutils.F90
26448 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           0000000000E29168  ch4mod_mp_ch4_tra        4186  ch4Mod.F90
26449 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           0000000000E1A6E6  ch4mod_mp_ch4_           2094  ch4Mod.F90
26450 dec0379.hsn.de.hpc.ucar.edu 120: cesm.exe           00000000005F8DE8  clm_driver_mp_clm        1203  clm_driver.F90

@glemieux
Copy link
Contributor

It looks like the old baseline I had tested against was not with the latest tags. I've moved that one and regenerated fates-sci.1.77.1_api.36.0.0-ctsm5.2.013 with the appropriate tags checked out. I've got fates suite tests rerunning against the newly generated baseline.

The updated run with corrected baseline comparison can be found here: /glade/derecho/scratch/glemieux/ctsm-tests/tests_0719-153219de.

The number of DIFFs has reduced, although it's a little difficult to discern at first as there are 45 DIFFs reported that only have to do with the dimensions differing, which is as expected.

The list of cprnc.out files with DIFFs that actually have non-zero differences can be found in the rms.out file at the top of the test directory. There are 17 files.

@rgknox
Copy link
Contributor Author

rgknox commented Jul 20, 2024

@glemieux , are there still run fails?

@glemieux
Copy link
Contributor

@rgknox yes I'm still seeing the same failures as noted here: #1198 (comment)

@glemieux
Copy link
Contributor

Retesting with nclmax = 2 to check if this impacts the failing tests.

@glemieux
Copy link
Contributor

Regression testing the fates suite with nclmax = 2 results in the ERS_P128x1_Lm25.f10_f10_mg37.I2000Clm60Fates.derecho_intel.clm-FatesColdNoComp test passing run now. The PRT2 test still fails.

@rgknox
Copy link
Contributor Author

rgknox commented Jul 24, 2024

Still working out why there is a failure in the ERS_P128x1_Lm25.f10_f10_mg37.I2000Clm60Fates.XX.FatesColdNocomp tests.

I've tried making it a debug test and running on gnu. The model failed at different places, the common thread seems to be related to heat/energy/temperature. This makes me suspect something isnt getting zero'd vis-a-vis radiation arrays...

I also tried removing the NoComp specification, ERS_D_P128x1_Lm25.f10_f10_mg37.I2000Clm60Fates.derecho_gnu.clm-FatesCold does pass...

@rgknox
Copy link
Contributor Author

rgknox commented Jul 25, 2024

@mpaiao @glemieux and other reviewers:

ERS_P128x1_Lm25.f10_f10_mg37.I2000Clm60Fates.XX.FatesColdNocomp fails with main when I up the nclmax to 3. So I don't believe the problem is with this pull request, main "should" pass this test when nclmax = 3.

I propose setting nclmax = 2 in this pull request, integrating following a re-do of tests, and then creating an issue. I'm happy to prioritize the issue.

@rgknox
Copy link
Contributor Author

rgknox commented Aug 1, 2024

tests look good, b4b with base: /glade/derecho/scratch/rgknox/tests_0731-094347de
Exception is a new test does not match base: PVT_Lm3.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesLUPFT, will form an issue about this.

@rgknox
Copy link
Contributor Author

rgknox commented Aug 14, 2024

Tests look good except for the PVT baseline, which is not passing for other tests as well.

/glade/derecho/scratch/rgknox/tests_0813-195403de

@rgknox rgknox merged commit b469786 into NGEET:main Aug 14, 2024
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants