Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use small PE counts for aux_clm tests and remove some redundant / unnecessary tests #1688

Merged
merged 10 commits into from
Mar 24, 2022

Conversation

billsacks
Copy link
Member

Description of changes

This PR reduces the PE count for aux_clm tests so that all (I think) tests fit in 4 nodes or less. This should
both (1) reduce queue wait time for tests and (2) reduce aux_clm testing cost (since, for high resolution
tests, most time is spent in initialization, so having more PEs increases cost significantly without actually
making the tests run much, if any, faster). While doing this, I also noticed some tests that felt redundant or unnecessary, and I have removed those.

To facilitate these changes, I have introduced new "S" PE layouts for the f09, f19 and 0.125nldas2 resolutions. These layouts use 4 nodes (but with DATM on just 1 node, because it is slightly faster to run it on one node than four nodes). (I considered running DATM on its own node like we do for production PE layouts, but that turns out to only improve runtime slightly while increasing cost significantly, so I am having DATM share one of the nodes used for LND.)

Here is a detailed list of the changes:

Changes to aux_clm test list

I changed the following:

2c2
< SMS_D_Ld9.f09_g17.I1850Clm50BgcNoAnthro.cheyenne_intel.clm-decStart1851_noinitial                         
---
> SMS_D_Ld3_PS.f09_g17.I1850Clm50BgcNoAnthro.cheyenne_intel.clm-decStart1851_noinitial                      
7c7
< SMS_D_Ld1.f09_g17.I1850Clm50Sp.cheyenne_intel.clm-default                                                  # Include a test of this scientifically-supported compset at a scientifically-supported resolution
---
> SMS_D_Ld1_PS.f09_g17.I1850Clm50Sp.cheyenne_intel.clm-default                                               # Include a test of this scientifically-supported compset at a scientifically-supported resolution
14,17c14,15
< SMS_D_Ln9_P480x3.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline                                    # Run a transient case with standalone settings similar to the FXHIST waccm test
< SMS_D_Ln9_P480x3_Vmct.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline                               # MCT test covering drydep, megan and Gregorian calendar.
---
> SMS_D_Ln9_P36x3.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline                                     # Run a transient case with standalone settings similar to the FXHIST waccm test
> SMS_D_Ln9_P36x3_Vmct.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline                                # MCT test covering drydep, megan and Gregorian calendar.
20c18
< ERI_Ld9.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_intel.clm-drydepnomegan                                    
---
> ERI_Ld9.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-drydepnomegan                                      
32c30
< SMS_Ln9_P360x2.C96_C96_mg17.IHistClm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode                    # Want one C96 test in the aux_clm test suite; just a short smoke test to make sure it can get off the ground. Use a PE layout that (1) has threading, because CAM uses threading at this resolution; and (2) has a smaller-than-standard task count in order to get through the queue faster.
---
> SMS_Ln9_P72x2.C96_C96_mg17.IHistClm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode                     # Want one C96 test in the aux_clm test suite; just a short smoke test to make sure it can get off the ground. Use a PE layout that (1) has threading, because CAM uses threading at this resolution; and (2) has a smaller-than-standard task count in order to get through the queue faster.
36,37c34
< ERP_D_Ld5.f19_g17.I2000Clm50BgcCru.cheyenne_intel.clm-fire_emis                                           
---
> ERP_D_Ld5.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-fire_emis                                        
42c39
< ERP_D_Ld5.f19_g17.IHistClm50SpCru.cheyenne_intel.clm-drydepnomegan                                        
---
> ERP_D_Ld5.f10_f10_mg37.IHistClm50SpCru.cheyenne_gnu.clm-drydepnomegan                                     
66,67c62
< ERP_Ld5.f10_f10_mg37.I1850Clm50Bgc.cheyenne_intel.clm-drydepnomegan                                       
---
> ERP_D_Ld5.f10_f10_mg37.I1850Clm50Bgc.cheyenne_gnu.clm-drydepnomegan                                       
87,90c78,80
< ERP_P180x2_D.f19_g17.I2000Clm50SpRtmFl.cheyenne_intel.clm-default                                          # include a debug test with flooding on
< ERP_P180x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_increase                       # cism is not answer preserving across processor changes, but short test length should be ok.
< ERP_P180x2_D_Ld5.f19_g17.I2000Clm50Sp.cheyenne_intel.clm-default                                          
---
> ERP_P36x2_D.f10_f10_mg37.I2000Clm50SpRtmFl.cheyenne_intel.clm-default                                      # include a debug test with flooding on
> ERP_P72x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_increase                        # cism is not answer preserving across processor changes, but short test length should be ok.
> ERP_P36x2_D_Ld5.f10_f10_mg37.I2000Clm50Sp.cheyenne_gnu.clm-default                                        
106,108c96
< ERS_D_Ld3.f19_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50dynroots                                      
---
> ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50dynroots                                 
114c102
< ERS_Ld3.f09_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-rad_hrly_light_res_half                              
---
> ERS_Ld3_D.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-rad_hrly_light_res_half                         
128,129c116,117
< LII_D_Ld3.f19_g17.I2000Clm50BgcCrop.cheyenne_intel.clm-default                                             # Basic LII test, covering the standard range of subgrid heterogeneity - particularly, including crop. Uses a year-2000 restart file so that the restart file has non-zero product pools, so that we exercise the gridcell-level code in init_interp.
< LII2FINIDATAREAS_D_P360x2_Ld1.f09_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default                         # Exercise the init_interp_method='use_finidat_areas' option. See documentation at the top of the python script implementing this test for more details and rationale. This test requires a compatible finidat file (i.e., a file that can be used without interpolation). If no such file is available out-of-the-box, then the test will need to use a testmod that points to a compatible file.
---
> LII_D_Ld3_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel.clm-default                                          # Basic LII test, covering the standard range of subgrid heterogeneity - particularly, including crop. Uses a year-2000 restart file so that the restart file has non-zero product pools, so that we exercise the gridcell-level code in init_interp.
> LII2FINIDATAREAS_D_P72x2_Ld1.f09_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default                          # Exercise the init_interp_method='use_finidat_areas' option. See documentation at the top of the python script implementing this test for more details and rationale. This test requires a compatible finidat file (i.e., a file that can be used without interpolation). If no such file is available out-of-the-box, then the test will need to use a testmod that points to a compatible file.
139c127
< SMS_D_Ld1.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist                                          
---
> SMS_D_Ld1_PS.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist                                       
141,142c129,130
< SMS_Vmct_D_Ld1.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist                                      # pgi test for MCT driver since nuopc is not passing
< SMS_D.f19_f19_mg17.I2010Clm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode                                 
---
> SMS_Vmct_D_Ld1_PS.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist                                   # pgi test for MCT driver since nuopc is not passing
> SMS_D_Ld1_PS.f19_f19_mg17.I2010Clm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode                          
173c161
< SMS_Ld1.f09_g17.I2000Clm50BgcCru.cheyenne_intel.clm-af_bias_v7                                            
---
> SMS_Ld1_PS.f09_g17.I2000Clm50BgcCru.cheyenne_gnu.clm-af_bias_v7                                           
177,178c165
< SMS_Ld2_D.f09_g17.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic_interp                                   # This gives a short debug test of the cmip6 configuration as well as a test of the cmip6 configuration at the production resolution, both of which we want. This test needs to use init_interp to work, because of adding virtual Antarctica columns.
---
> SMS_Ld2_D_PS.f09_g17.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic_interp                                # This gives a short debug test of the cmip6 configuration as well as a test of the cmip6 configuration at the production resolution, both of which we want. This test needs to use init_interp to work, because of adding virtual Antarctica columns.
186c172
< SMS_Lm13.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-cropMonthOutput                                      # include a relatively long crop test at relatively high resolution
---
> SMS_Lm13_PS.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-cropMonthOutput                                   # include a relatively long crop test at relatively high resolution
190c176
< ERP_D_Ld10.f19_g17.I1850Clm51BgcCrop.cheyenne_intel.clm-ADspinup                                           # Include a restart test for AD spinup mode because of specific logic for spinup_state. Lack of this test did cause a problem.
---
> ERP_D_Ld10.f10_f10_mg37.I1850Clm51BgcCrop.cheyenne_intel.clm-ADspinup                                      # Include a restart test for AD spinup mode because of specific logic for spinup_state. Lack of this test did cause a problem.
195c181
< ERS_D_Ld5.f09_g17.I2000Clm50FatesRs.cheyenne_intel.clm-FatesColdDef                                        # Want one fates test on a large grid: Since FATES has cohorts, it has potential to be a massive memory consumer and netcdf array size maker, so the large grid test will help smoke out these types of issues (and it's a restart test to cover possible memory/netcdf size issues with the restart file).
---
> ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.cheyenne_intel.clm-FatesColdDef                                     # Want one fates test on a large grid: Since FATES has cohorts, it has potential to be a massive memory consumer and netcdf array size maker, so the large grid test will help smoke out these types of issues (and it's a restart test to cover possible memory/netcdf size issues with the restart file).
209c195
< SMS_Ld5.f19_g17.I2000Clm50FatesRs.cheyenne_intel.clm-FatesColdDef                                         
---
> SMS_Ld5_PS.f19_g17.I2000Clm50FatesRs.cheyenne_gnu.clm-FatesColdDef                                        
223,224c209,210
< SMS_Ld1.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldas.cheyenne_gnu.clm-default                              # Include a short smoke test covering the nldas2 grid and the I2000Ctsm50NwpSpNldas compset, which uses NLDAS datm forcing.
< SMS_Ld1.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default                            # Include a short smoke test covering the nldas2 grid and the I2000Ctsm50NwpSpNldasRs compset, which uses NLDAS datm forcing.
---
> SMS_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldas.cheyenne_gnu.clm-default                           # Include a short smoke test covering the nldas2 grid and the I2000Ctsm50NwpSpNldas compset, which uses NLDAS datm forcing.
> SMS_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default                         # Include a short smoke test covering the nldas2 grid and the I2000Ctsm50NwpSpNldasRs compset, which uses NLDAS datm forcing.
232c217
< PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel                                                          # Can use this test to determine if there are significant throughput changes, at least for this common and important configuration. Note that this deliberately doesn't have any testmods in order to (1) avoid doing history output (because the timing of output can be very variable, and mixing output timing with other aspects of model time can be confusing), and (2) generally keep the test replicating a production configuration as closely as possible (so, for example, we do NOT set BFBFLAG=TRUE for this test).
---
> PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel                                                       # Can use this test to determine if there are significant throughput changes, at least for this common and important configuration. Note that this deliberately doesn't have any testmods in order to (1) avoid doing history output (because the timing of output can be very variable, and mixing output timing with other aspects of model time can be confusing), and (2) generally keep the test replicating a production configuration as closely as possible (so, for example, we do NOT set BFBFLAG=TRUE for this test).

In addition, I moved the following tests to ctsm_sci:

SMS_Ln9_P144x3.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline2005Start # Do a test similar to FXHIST starting at a 2005 start date, will interpoalte from the 2003 IC file
SMS_Lm1.f19_g17.I1850Clm50BgcCropCmip6waccm.cheyenne_intel.clm-basic # The main point of this test is simply to make sure that the CMIP6WACCMDECK modifier works for 2-degree since that resolution turns off Carbon isotopes 

and I completely removed the following tests:

SMS_Ld1.f09_g17.I1850Clm50Bgc.cheyenne_intel.clm-drydepnomegan
ERI_Ld9.f10_f10_mg37.I1850Clm50Bgc.cheyenne_gnu.clm-drydepnomegan
ERP_D_Ld5.f19_g17.I2000Clm50BgcCru.cheyenne_intel.clm-default
ERP_Ld5.f19_g17.I1850Clm50Bgc.cheyenne_intel.clm-default
ERP_Ld5.f19_g17.I2000Clm50SpRtmFl.cheyenne_gnu.clm-default
ERP_Ld5.f19_g17.I2000Clm50BgcCru.cheyenne_intel.clm-default
ERP_Ld3.f09_g17.I1850Clm50BgcCropCru.cheyenne_intel.clm-ciso
ERP_P180x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.cheyenne_intel.clm-default
ERS_D_Ld3.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_intel.clm-default
ERS_D_Ld3.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-default
SMS_Ld5.f19_g17.IHistClm50Bgc.cheyenne_intel.clm-decStart
SMS_Lm1.f19_g17.I1850Clm50Bgc.cheyenne_intel.clm-clm50dynroots

In general, these removed tests felt redundant with existing tests (including some low-resolution tests that I happened to notice as feeling redundant as I was reworking the testing). However, I am not totally confident about my changes and removal of some drydepnomegan tests: I removed two tests of drydepnomegan and changed some others to lower resolution; is this okay, or is there particular value in having high-resolution tests of drydepnomegan?

Changes to ctsm_sci test list

See notes above for tests moved from the aux_clm test list to the ctsm_sci test list; other than that, I did not make any changes to the ctsm_sci test list.

Changes to prealpha test list

I made the following changes to the prealpha test list to keep it in line with the aux_clm test list:

< prealpha: ERP_Ld5.f19_g17.I2000Clm50SpRtmFl.cheyenne_gnu.clm-default                        
< prealpha: ERP_P180x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.cheyenne_intel.clm-default      
< prealpha: ERP_P180x2_D_Ld5.f19_g17.I2000Clm50Sp.cheyenne_intel.clm-default                  
---
> prealpha: ERP_P36x2_D.f10_f10_mg37.I2000Clm50SpRtmFl.cheyenne_intel.clm-default               # include a debug test with flooding on
> prealpha: ERP_P72x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_increase # cism is not answer preserving across processor changes, but short test length should be ok.
> prealpha: ERP_P36x2_D_Ld5.f10_f10_mg37.I2000Clm50Sp.cheyenne_gnu.clm-default                 

Changes to prebeta test list

I made the following changes to the prealpha test list to keep it in line with the aux_clm test list:

1c1
< prebeta: SMS_D_Ld9.f09_g17.I1850Clm50BgcNoAnthro.cheyenne_intel.clm-decStart1851_noinitial          
---
> prebeta: SMS_D_Ld3_PS.f09_g17.I1850Clm50BgcNoAnthro.cheyenne_intel.clm-decStart1851_noinitial       
3,4c3
< prebeta: ERP_D_Ld5.f19_g17.I2000Clm50BgcCru.cheyenne_intel.clm-fire_emis                            
< prebeta: ERP_D_Ld5.f19_g17.IHistClm50SpCru.cheyenne_intel.clm-drydepnomegan                         
---
> prebeta: ERP_D_Ld5.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-fire_emis
> prebeta: ERP_D_Ld5.f10_f10_mg37.IHistClm50SpCru.cheyenne_gnu.clm-drydepnomegan
9,10c8,9
< prebeta: ERS_D_Ld3.f19_g17.I1850Clm50BgcCrop.izumi_intel.clm-clm50dynroots                          
< prebeta: ERS_D_Ld3.f19_g17.I1850Clm50BgcCrop.izumi_pgi.clm-clm50dynroots                            
---
> prebeta: ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.izumi_intel.clm-clm50dynroots                     
> prebeta: ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.izumi_pgi.clm-clm50dynroots                       

In addition, I removed the following from the prebeta test list:

ERP_Ld5.f19_g17.I2000Clm50SpRtmFl.cheyenne_gnu.clm-default
SMS_Ld1.f19_g17.I2000Clm50BgcCru.cheyenne_intel.clm-default
SMS_Lm1.f19_g17.I1850Clm50BgcCropCmip6waccm.cheyenne_intel.clm-basic # The main point of this test is simply to make sure that the CMIP6WACCMDECK modifier works for 2-degree since that resolution turns off Carbon isotopes

For the removed tests, we already have similar tests to these removed tests in the prealpha and/or prebeta test suites. (It is unnecessary to have the same tests in both prealpha and prebeta test lists, because the prebeta test list is generally just run on tags that have already been through prealpha testing.)

Other test categories

I did not make any changes to other test categories: fates, aux_cime_baselines, clm_pymods or clm_short. The fates test category has some high resolution tests that should probably be changed similarly to what I did here, but I'd like to leave that up to fates developers.

Some notes on the changed tests

All of the new (changed) tests pass. Nearly all of them run in less than 10 min (and some of them actually run faster than before); the one exception is SMS_Lm13_PS.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-cropMonthOutput, which has a 31 min run time; that is as expected. I feel like that's acceptable: it's in the neighborhood of some of our other long tests, and I feel like it's worth having one more long-running test in order to avoid having anything that might sit in the queue too long. We could consider putting this test on something like 8 nodes instead of 4 if we find that it is consistently the last test to finish, but for now I wanted to try for a goal of having all tests on 4 nodes or fewer.

I compared the detailed timing files from the new PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel test with the original PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel test. For the most part, the relative timings of different pieces are about the same in both – at least for the major culprits, which is what I compared. So this is good: I feel like we can use the new test to still give indications of when the timing of a particular section changes significantly, and be reasonably confident that the same result would hold for an f09 test with higher processor count.

Specific notes

Contributors other than yourself, if any:

CTSM Issues Fixed (include github issue #):

Are answers expected to change (and if so in what way)? NO (but there will be baseline failures)

Any User Interface Changes (namelist or namelist defaults changes)? NO

Testing performed, if any:

Ran all new/changed tests:

SMS_D_Ld3_PS.f09_g17.I1850Clm50BgcNoAnthro.cheyenne_intel.clm-decStart1851_noinitial
SMS_D_Ld1_PS.f09_g17.I1850Clm50Sp.cheyenne_intel.clm-default
SMS_D_Ln9_P36x3.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline
SMS_D_Ln9_P36x3_Vmct.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline
ERI_Ld9.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-drydepnomegan
SMS_Ln9_P72x2.C96_C96_mg17.IHistClm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode
ERP_D_Ld5.f10_f10_mg37.I2000Clm50BgcCru.cheyenne_gnu.clm-fire_emis
ERP_D_Ld5.f10_f10_mg37.IHistClm50SpCru.cheyenne_gnu.clm-drydepnomegan
ERP_D_Ld5.f10_f10_mg37.I1850Clm50Bgc.cheyenne_gnu.clm-drydepnomegan
ERP_P36x2_D.f10_f10_mg37.I2000Clm50SpRtmFl.cheyenne_intel.clm-default
ERP_P72x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_increase
ERP_P36x2_D_Ld5.f10_f10_mg37.I2000Clm50Sp.cheyenne_gnu.clm-default
ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50dynroots
ERS_Ld3_D.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-rad_hrly_light_res_half
LII_D_Ld3_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel.clm-default
LII2FINIDATAREAS_D_P72x2_Ld1.f09_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-default
SMS_D_Ld1_PS.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist
SMS_Vmct_D_Ld1_PS.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist
SMS_D_Ld1_PS.f19_f19_mg17.I2010Clm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode
SMS_Ld1_PS.f09_g17.I2000Clm50BgcCru.cheyenne_gnu.clm-af_bias_v7
SMS_Ld2_D_PS.f09_g17.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic_interp
SMS_Lm13_PS.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-cropMonthOutput
ERP_D_Ld10.f10_f10_mg37.I1850Clm51BgcCrop.cheyenne_intel.clm-ADspinup
ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.cheyenne_intel.clm-FatesColdDef
SMS_Ld5_PS.f19_g17.I2000Clm50FatesRs.cheyenne_gnu.clm-FatesColdDef
SMS_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldas.cheyenne_gnu.clm-default
SMS_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default
PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel

billsacks and others added 7 commits March 17, 2022 12:53
With this small node count, having datm on its own node only makes
things a little faster, and adds significantly to the cost (and might
lead to slightly longer queue wait times), so I don't think it's worth
it. However, datm is slightly faster with just one node rather than 4,
so I'm maintaining that aspect of the PE layout.
I compared the detailed timing files from this new
PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel test with the
original PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel test. For the
most part, the relative timings of different pieces are about the same
in both - at least for the major culprits, which is what I compared. So
this is good: I feel like we can use the new test to still give
indications of when the timing of a particular section changes
significantly, and be reasonably confident that the same result would
hold for an f09 test with higher processor count.
Main goal is to reduce the PE count for aux_clm tests so that all tests
fit in 4 nodes or less. This should both (1) reduce queue wait time for
tests and (2) reduce aux_clm testing cost (since, for high resolution
tests, most time is spent in initialization, so having more PEs
increases cost significantly without actually making the tests run much,
if any, faster).

While doing this, I also noticed some tests that felt redundant or
unnecessary, and I have removed those.

I will give more complete documentation of the changes in the associated
Pull Request.
@billsacks billsacks added testing additions or changes to tests PR status: ready PR: this is ready to merge in, with all tests satisfactory and reviews complete labels Mar 21, 2022
@billsacks billsacks requested a review from ekluzek March 21, 2022 14:16
@billsacks billsacks self-assigned this Mar 21, 2022
@billsacks
Copy link
Member Author

@ekluzek I'd welcome your sign-off on this before I make a tag with these changes, if you'd like to give it a quick look. I'd suggest just looking at the detailed comment I made above rather than trying to look at the diffs in the testlist xml, which can be hard to look at. Note that I have a specific question about the drydepnomegan tests: I removed two tests of drydepnomegan and changed some others to lower resolution; is this okay, or is there particular value in having high-resolution tests of drydepnomegan?

@glemieux @rgknox note that I have NOT changed any fates category tests; I wanted to leave it up to you to make any changes to that test category. The main changes needed would be adding a _PS qualifier to any f09 or f19 tests to use the new, smaller PE counts for these tests.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 21, 2022

@billsacks the main reason the drydep test is at f09 is that the time when drydep matters are for cases coupled to CAM with chemistry which are going to be done at f09 or f19 and never at lower resolution. However, we haven't run into trouble with drydep, so I think it's OK to relax the resolution here for our standard testing. Especially with the idea that we'll keep the higher resolution testing in the ctsm_sci test list which we'll run at a lower frequency. When we had everything in aux_clm I would be more reluctant to lower it.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 21, 2022

The other thing I looked into in terms of resolution was to look at the VOC raw dataset, which we have at half degree resolution, and it's got really sharp gradients in the dataset. This comes into play for most of our tests as we by default turn MEGAN on for most. But, the main time that VOC's matter is again when coupled to CAM with chemistry on, so will apply to higher resolution cases. A half degree case is considered the standard resolution for full chemistry. But, the question for our testing is can we detect a problem with drydep or VOC that would impact higher resolution chemistry simulations for an I compset? And probably the main issue is going to be dataset issues that could be found at any resolution. That coupled with the fact that we haven't run into trouble with drydep or VOC's, makes me think it would be OK to relax the resolution back to something more course. We should just make sure the ctsm_sci test list has a good list of resolutions for these full chemistry simulations.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really glad you are doing this. In practice I know how difficult it can be to make these assessments, which is probably why we tend to avoid it. But, improving the performance of our testing can really improve our entire workflow of everyone, so it's worth doing.

I agree with most of what you are doing here. I think two of your PE layout changes are wrong, but we should go over that. I also recommend adding back the flooding test, but at a lower resolution.

There's one test you moved to ctsm_sci and I just think there should be a comment on why it's there.

I added a checkbox for things that I think something needs to happen. For other cases I just agree with your assessment, so you don't need to do anything.

<rootpe_glc>-1</rootpe_glc>
<rootpe_wav>-1</rootpe_wav>
<rootpe_cpl>-1</rootpe_cpl>
<rootpe_lnd>0</rootpe_lnd>
Copy link
Collaborator

@ekluzek ekluzek Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong to me as it looks like the f19 layout for izumi/hobart is concurrent with one node for datm, and this is starting everything on the datm root node.

  • Make sure this is right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billsacks explains below, but basically datm doesn't scale well so it's better off running on only one node, but sequential with CTSM. The higher node count setups rely on datm running on 1 node running about as fast as many nodes for CTSM, with smaller node counts we can't do that.

@@ -2411,8 +2313,7 @@
</test>
<test name="SMS_Lm1" grid="f19_g17" compset="I1850Clm50BgcCropCmip6waccm" testmods="clm/basic">
<machines>
<machine name="cheyenne" compiler="intel" category="aux_clm"/>
<machine name="cheyenne" compiler="intel" category="prebeta"/>
<machine name="cheyenne" compiler="intel" category="ctsm_sci"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add a note here about why this test is in the ctsm_sci test list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I just said that I'd do this, but I realized that there already is a comment on this test (you need to expand the diff below to see it: "The main point of this test is simply to make sure that the CMIP6WACCMDECK modifier works for 2-degree since that resolution turns off Carbon isotopes"). I'm not sure what else to say besides that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I looked at the file to see what this looks like. My point is that the other ctsm_sci tests are all for running the science supported compsets/resolutions. This isn't listed as a science_supported compset or resolution so I think it warrants saying why we have this in ctsm_sci. So I suggest adding something like this...

This isn't a science supported compset, but it's critical to run this test at 2-degree resolution which is higher than our standard testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification; I'll add a note like that.

@billsacks
Copy link
Member Author

Thanks a lot for your careful look at this, Erik, and for sharing your thoughts. I agree with the thoughts you shared in your initial comments on the PR – that, although some of these tests are mainly / only relevant at higher resolutions, we probably have good enough code coverage of the relevant code by running lower resolution tests in aux_clm, saving the higher resolution versions for ctsm_sci. (Really, the same can probably be said of most of our configurations.)

Regarding the PE layout changes: I initially started with layouts with datm concurrent with lnd, as we do for our production PE layouts. However, I found that that doesn't seem to make sense for these small PE layouts. The reason why it makes sense for the large PE layouts is that the datm run time is roughly similar to the lnd run time. However, with these small PE layouts for f09 and f19, the time is dominated by the lnd run time. So making datm concurrent with lnd leads to a situation where the datm node is sitting idle for most of the run. With the current PE layout, the non-datm nodes sit idle while datm runs, but that turns out to be less wasteful than having the datm node sit idle for most of the run.

To put some specific numbers on it: For the f19 case, datm runs at 0.474 seconds/mday, and lnd runs at 3.857 seconds/mday. So the PE layout I have in place means that 3/4 nodes sit idle for about 11% of the run time while datm runs. If we instead made datm concurrent with lnd, we would have a situation where 1/5 node sits idle for about 89% of the run time. So the present PE layout is somewhat lower cost, at a slight additional cost in run time (but potentially less queue wait time since we're only asking for 4 nodes rather than 5). The difference is bigger in the f09 case, where datm runs at 0.864 seconds/mday and lnd at 10.518 seconds/mday, so the 3/4 nodes are just sitting idle for 8% of the run time, whereas a concurrent layout would lead to 1/5 node sitting idle for 92% of the run time. The differences here aren't huge, and if you'd rather have tests that are slightly faster but also slightly more expensive, I'm fine changing these layouts so datm is running concurrently. But I just wanted to point out that what makes sense for high PE counts doesn't make sense in all the same ways for these lower PE counts.

Regarding the flooding tests: we originally had two, a non-debug and a debug one:

ERP_Ld5.f19_g17.I2000Clm50SpRtmFl.cheyenne_gnu.clm-default
ERP_P180x2_D.f19_g17.I2000Clm50SpRtmFl.cheyenne_intel.clm-default

I moved the debug one to lower resolution (ERP_P36x2_D.f10_f10_mg37.I2000Clm50SpRtmFl.cheyenne_intel.clm-default) and got rid of the non-debug one. My general feeling is that it isn't worth having a non-debug test of a configuration if we already have a debug test of that configuration, except for maybe a few exceptions of our most important configurations (and maybe more exceptions in the ctsm_sci test list): if these non-debug tests pick up anything unique, it seems to mostly be compiler issues rather than code issues. That said, if you feel it's important to maintain a non-debug test as well as a debug test of flooding, I can do so (probably in the ctsm_sci test list). I couldn't tell when you made your comment if you realized that there is still a RtmFl test at coarse resolution.

Finally, I will add a comment about the new test in the ctsm_sci test list: thanks for that suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR status: ready PR: this is ready to merge in, with all tests satisfactory and reviews complete testing additions or changes to tests
Projects
Status: Done (non release/external)
Development

Successfully merging this pull request may close these issues.

3 participants