Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework CLM test suite: miscellaneous things to do to shorten turnaround time and increase coverage #275

Open
6 of 15 tasks
billsacks opened this issue Feb 13, 2018 · 26 comments
Labels
performance idea or PR to improve performance (e.g. throughput, memory) testing additions or changes to tests

Comments

@billsacks
Copy link
Member

billsacks commented Feb 13, 2018

There are a number of changes we could make to shorten the turnaround time of the test suite while increasing coverage. This issue documents some ideas.

I feel like our goal should be a sub-4-hour test suite – or at the very least, sub-4-hours for the subset of the test suite for which baseline comparisons are important (see a comment below for thoughts on possibly separating this subset of the test suite). Edit (2019-01-11) I'd also like to get the core-hour requirement down to something like 2,000, and ideally even lower.

Here are some ideas:

  • Change a bunch of tests from 1850 to transient: this covers more code (e.g., http://bugs.cgd.ucar.edu/show_bug.cgi?id=2207 slipped through because we didn't test ciso in combination with a transient case)

  • Rework test suite to get better balance between different compilers, to shorten turnaround time. If you fire off a different test suite for each machine/compiler (which is the standard practice), then the ideal is to have roughly equal balance between the different machine_compiler combinations, at least for cheyenne_intel, cheyenne_gnu and hobart_nag. (However, gnu is less good at picking up debug issues, so it's not good to have important configurations only covered by gnu debug.)

  • Remove or rework tests that take a large number of processor-hours.

    • high-res cases should just be short smoke tests, non-debug or very short debug
    • VIC tests don't really need to be at f09 (I think they were originally put at f09 because that was the scientifically-supported resolution for VIC, but for software testing, coarse-resolution is sufficient, with maybe just one very short f09 smoke test - e.g., a few time steps)
    • we shouldn't have high-resolution ERI tests, since ERI tests are expensive
    • However: it does seem important to have at least one long (e.g., 5-year) global restart test: our 5-year ERS test at f10 resolution has often picked up problems that shorter tests miss (I suspect that a somewhat shorter test at higher resolution would do the same job: I think the key is having a test that runs over enough gridcells x timesteps that it covers most code paths for at least the most important configuration(s))
    • (2020-09-18) Could probably shorten the 1-degree PFS test to 10 days rather than 20
    • (2020-09-30) Can move some non-critical tests to the new "ctsm_sci" test list. e.g., SMS_Ln9.ne30pg2_ne30pg2_mg17.I2000Clm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode takes a long time, due to time needed for init_interp.
    • (2022-03-09) Would it help to significantly decrease the processor count for high-resolution tests, at least the short ones? A lot of the time in these tests is spent in model initialization, and we could probably decrease the core-hour cost significantly with minimal impact on turnaround time by decreasing the processor count for these tests. (A better solution is to change these to coarse-resolution tests, but there are some tests that need to be at production resolution.)
      • (2022-03-14) Yes, this seems to be a good solution. I did this for a couple of tests in 3244887 and found that the total test run time is about the same with 1/10 the processors – and in some cases even faster, due to most of the total run time being spent in model initialization.
    • (2022-03-25) I have done this pretty well in Use small PE counts for aux_clm tests and remove some redundant / unnecessary tests #1688 . There's probably more that could be done, but we're in a pretty good place now in this respect.
  • (2018-10-27) Make more tests use SGLC rather than CISM, to shorten build times

  • (2019-01-23) Remove / rework some tests that have a long wallclock time, unless the long time is really necessary for the test.

    • Example: SMS_D_Ly2.1x1_numaIA.IHistClm50BgcCropGs.cheyenne_intel.clm-ciso_bombspike1963 (currently takes > 2 hours).
    • Rationale: These long tests can delay the completion of the full test suite, unless they start early on. And if they fail due to system issues, it can take a long time to wait for them to rerun.
  • (2019-04-29) Consider changing some multi-year tests to short, decStart tests. In particular: we have at least one or two (and possibly more) multi-year tests because of Carbon balance error in decStart test just after new year #404 . With ctsm1.0.dev036, that issue seems to be resolved. We should look critically at our global tests that cross the year boundary and consider if any of them can/should be changed to decStart tests.

    • One in particular is the new LCISO test: Erik made that an Lm13 test rather than a short, decStart test because of Carbon balance error in decStart test just after new year #404 . We could consider making it a short, decStart test. However, note that the case with ciso would then fail in the wrong way if we tested this test by reintroducing the bug that was fixed in ctsm1.0.dev036 (it would fail with a C balance error rather than the LCISO test failing in the comparison between cases). We should think about whether that's okay, or if that suggests that we should keep this as a 13-month test. My initial inclination is that that's okay, and it's fine to change this to a decStart test. At the very least, though, we should confirm that the decStart LCISO test would still pick up problems by (1) reintroducing the bug that was fixed in ctsm1.0.dev036, and (2) commenting out the endrun associated with the C balance check: confirm that the LCISO test fails the comparison between cases, as expected.
  • (2019-05-01) I don't think we're getting much benefit from long single-point tests like ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcDvCropQianGs.cheyenne_intel.clm-cropMonthOutput (and maybe others?), unless they are constructed specially to target particular things (like ERS_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianGs.cheyenne_intel.clm-cropMonthOutput # restart is right before increasing natural veg to > 0 while also shifting PCT_CFT). I think that, at one point, these long tests were important to test crop more fully, since most crop tests started from cold start and only tested the first few years of crop, at most. Now that nearly everything uses spun-up initial conditions, this is no longer the case. One way to keep these long single-point tests valuable would be to make them use cold start: then they would be testing the first couple of decades of run from cold start, which isn't otherwise tested.

    • (2022-02-24) In Rework single point testing #1660 I have changed these long single point tests to be cold start, so at least we're getting some unique benefit from them
  • (2020-05-20) Substantially trim the PTS mode testing (ptsRLA, ptsRLB, ptsROA tests). Erik says:

    You are certainly right we have plenty of PTS_MODE testing. And maybe too much. Although it might be good to see what testing CAM does for SCAM and make sure our testing lines up that way. RLA and RLB are just two different points, so we don't need everything to run with both. ROA is a point over ocean to make sure that works when running SCAM. So we probably only need one of those.

    Then:

    I just looked at the CAM testing for SCAM and it looks like they just test with cheyenne_intel, and only have three tests for it. Originally I thought PTS_MODE could be useful for CLM developers -- but it hasn't worked out that way. It's much easier in CLM (than in CAM) to just subset the files for a single point rather than having a special configuration for it. So I'd say we could cut back our PTS_MODE testing in CTSM from what it is now.

  • (2020-06-07) We could change a bunch of our tests to use Qian datm forcing in order to speed up datm, since this likely wouldn't decrease the test coverage of the CTSM code. (However, this is only helpful if datm is a limitation in tests; this is true for single-point cases, but I'm not sure if it's true for global cases. We should check whether datm is a limiting factor in our typical f10 cases and other tests; if so, we should consider making this or a similar change.)

  • (2020-09-18) For tests that use a lot of processors: make some changes so they get through the queue faster:

    • try to reduce the walltime limit to something pretty small so they get through the queue faster
    • consider using an alternative, smaller PE layout for these tests, either by hard-coding the PE layout in the test (e.g., _P720x1) or using the mechanism to give multiple PE layouts for a given configuration in config_pes.xml
      • (2020-11-04) I changed our C96 test to P360x2, reducing its TOTALPES from 3528 to 720. This reduced its queue wait time from hours to seconds. I thought about introducing an alternative PE layout in config_pes.xml, but that felt overly complex given that we only need this for testing. (At first I liked that an alternative PE layout could allow us to keep datm on its own processors, but from looking at the test timing, this wouldn't make much, if any, difference.)
      • (2021-04-02) I could imagine having all tests in the aux_clm test suite have small PE counts, with production-level PE counts restricted to the clm_sci test list, which we should run every month or two (without baseline comparisons)
      • (2022-03-14) I started applying this idea to some tests in 3244887. This seems like a good approach: the test run time is about the same as before with about 1/10 the number of processors.
    • (2020-11-16) Note that, in the latest run of the test suite, everything except the f19 and f09 tests finished, and those higher-resolution tests were still waiting in the queue, highlighting the importance of this.
    • Could we add a feature to the test list that lets you specify the queue for a single test? Then we could specify the premium queue for the large-processor-count-but-short tests.
    • (2022-03-25) I have done this in Use small PE counts for aux_clm tests and remove some redundant / unnecessary tests #1688 by having all tests run on 5 nodes or fewer.
  • (2020-10-06) Single point tests on cheyenne have seemed flaky lately, frequently dying due to system issues. My guess is that this is because they are on shared nodes. Is there something we can do about this? We could move them to full nodes on izumi, or maybe even full nodes on cheyenne if they are short enough... though it tends to be the longer tests that are dying. So maybe the solution is just to get rid of the long single-point tests, as noted above.

  • (2020-10-19) Make sure that all tests on izumi use reduced output, since i/o is often a huge runtime cost of the izumi tests.

  • (2020-11-04) Consider making all izumi tests single-node, since multi-node tests may be more prone to system issues.

  • (2021-03-29) Consider setting the co2 coupling flag so that lnd -> atm co2 fluxes are sent in all tests where this is possible, in order to test this coupling.

    • Originally I had been thinking of a separate testmod to enable this and possibly other optional lnd -> atm couplings, but it seems better to just do this for all tests.
    • However, this isn't crucial to do, since it would only cover a small number of lines beyond what's tested by adding FCO2 to the CTSM history file.
    • If we do this, we can remove the addition of the FCO2 history field in the default test mod.
    • Note that this might require having a separate "defaultbgc" testmod directory that inherits from default and that is included instead of default in at least some tests that include BGC; this "defaultbgc" testmod directory could add settings like this (and possibly some other history fields) that are unique to BGC.
  • (2021-04-15) I think we could decrease our ERI testing: It's important to have some ERI testing, but it's probably sufficient to just have a few such tests. I would especially like to reduce or eliminate our ERI testing on izumi: with the multiple runs of this test, it is more prone to the periodic system failures we see on izumi, and there may be problems with rerunning ERI tests (as seen in HLM-side changes to allow FATES snow occlusion of LAI #1324).

@billsacks billsacks added the enhancement new capability or improved behavior of existing capability label Feb 13, 2018
@billsacks
Copy link
Member Author

Up until now, the majority of our testing has been on cheyenne (or before that, yellowstone), with a smaller set of tests on hobart. But recently, hobart seems to be more stable than cheyenne. So I'm wondering if we should move most tests to hobart, mainly keeping some longer and higher-resolution tests on cheyenne.

@billsacks billsacks changed the title Rework CLM test suite Rework CLM test suite: miscellaneous things to do Feb 13, 2018
@jhamman
Copy link
Contributor

jhamman commented Feb 13, 2018

Would it be possible at some point to get a subset of the CTSM tests that can be run on an arbitrary UNIX machine and to have those test be run on a Continuous Integration service like TravisCI or CircleCI? Just something to consider going forward, this tends to be standard for open-source projects these days and really helps maintain portability and usability of the model. A starting point may just be to build a standard compset with generic libraries.

@serbinsh
Copy link
Contributor

serbinsh commented Feb 13, 2018 via email

@billsacks
Copy link
Member Author

@jhamman and @serbinsh good points - thanks. I have moved the discussion of a CI server to #278 because I feel that warrants its own issue. Regarding a short test suite: we do have one (clm_short), but I haven't tried running it on a different machine recently to see if the mechanics for doing so are the same as before. @bandre-ucar might have some experience here.

@billsacks billsacks changed the title Rework CLM test suite: miscellaneous things to do Rework CLM test suite: miscellaneous things to do to shorten turnaround time and increase coverage Feb 13, 2018
@billsacks
Copy link
Member Author

Regarding this:

Up until now, the majority of our testing has been on cheyenne (or before that, yellowstone), with a smaller set of tests on hobart. But recently, hobart seems to be more stable than cheyenne. So I'm wondering if we should move most tests to hobart, mainly keeping some longer and higher-resolution tests on cheyenne.

I'm still feeling it would be helpful to move more tests to hobart. At a minimum, we might as well move enough tests from cheyenne to hobart that the cheyenne and hobart turnaround times are about the same, in order to decrease total time to completion of the test suite.

@billsacks
Copy link
Member Author

billsacks commented May 4, 2018

It might be worth splitting the aux_clm test category into two (e.g., aux_clm1, aux_clm2), particularly for the cheyenne testing. This way we could shorten the long time it takes to setup and build all of the tests in the test suite: we could use separate create_test invocations for the different categories, each built on a different node. A shortened build time wouldn't necessarily translate into a shorter overall turnaround time, but in many cases it probably would (but we should test this before moving forward with this idea).

This would make it slightly more complicated to run and check the test results, though that could be mitigated if we had a wrapper script that helps with these things, which would be good to have anyway.

The tests could be separated arbitrarily, or along some logical division (e.g., debug vs. non-debug tests; note that separating this on build-related aspects could help a bit due to the shared builds used in testing – otherwise, both test suites would end up doing some of the same builds, which would slow things down a bit though probably wouldn't be a huge deal).

Another way to do this split is if we separate out the tests for which we want to generate/compare baselines (as I think I have discussed elsewhere): We could have a separate test category for which we do baseline comparison/generation, which is a subset of the full test suite.

[Update (2020-03-15)] With the run_sys_tests wrapper, this becomes more feasible. We could have some argument to this wrapper script that says "run all of the standard testing suites" (which is mutually exclusive with the existing --suite-name, --testfile and --testname arguments). It could help if one or more of the suites can use the exact same build for every test in the suite: then we can use the new cime feature that reuses a single build for all tests in a test suite. One possible new category of tests would be a category that is solely for the sake of having short smoke tests of the various grids we need to support (see recent meeting notes); this category would lend itself to using the same build for each test (e.g., all SMS_D single-threaded tests with the same set of components).

@billsacks billsacks added testing additions or changes to tests and removed enhancement new capability or improved behavior of existing capability labels Jun 14, 2018
@billsacks
Copy link
Member Author

billsacks commented Jan 10, 2019

Looking at some recent runs of the test suite: Our full cheyenne test suite takes about 7700 core-hours, plus about 200 core-hours to do all of the test builds. I am attaching a spreadsheet with the tests sorted by core-hour usage, along with the cumulative time. Out of the 188 tests, 18 are responsible for > 50% of the total test suite time. I'm going to look into cutting or changing those tests to reduce our cheyenne usage.

aux_clm_sorted.csv.txt

(Note: In August, 2016, the test suite cost about 4000 core-hours. It would be great if we could get back down to about that level or lower.)

@rgknox
Copy link
Collaborator

rgknox commented Jan 10, 2019

@billsacks, we test fates with 1 test using an f09_g16 grid, which seems to be very expensive according to your chart (which makes sense being so large..). It does not seem to be a bottleneck in terms of holding up the whole test suite, but if its impacting the testing quota then I could find ways to reduce/remove this one.

@billsacks
Copy link
Member Author

@rgknox FATES doesn't seem to be a big culprit here (or maybe you're referring to the fates test list, which I haven't looked at). ERS_D_Ld5.f09_g17.I2000Clm50Fates.cheyenne_intel.clm-FatesColdDef takes 85 core-hours and ERS_D_Ld5.f09_g17.I2000Clm45Fates.cheyenne_intel.clm-FatesColdDef takes 80 core-hours. Those aren't expensive enough to be targets of my first pass, though in a later pass I'd like to more aggressively remove things from the test suite that aren't pulling their weight. For the sake of that later pass, I'd be happy on any thoughts about (1) Whether the Clm45Fates f09_g17 test can be cut out (does anyone run Fates with Clm45? We still do have coarser-resolution Clm45Fates tests.) (2) Whether the Clm50Fates f09_g17 test can be changed to non-debug, and/or coarser resolution, and/or shortened (e.g., to _Ld3).

@rgknox
Copy link
Collaborator

rgknox commented Jan 10, 2019

(1) Whether the Clm45Fates f09_g17 test can be cut out (does anyone run Fates with Clm45? We still do have coarser-resolution Clm45Fates tests.)

I don't think so... we could at least get rid of Clm45Fates f09_g17

(2) Whether the Clm50Fates f09_g17 test can be changed to non-debug, and/or coarser resolution, and/or shortened (e.g., to _Ld3).

Yes to non-debug and shortened, good ideas. In general, if fates tests are not a huge time cuplrit, I would just like to maintain one fates test on a large grid (f09-ish). Since FATES has cohorts, it has potential to be a massive memory consumer and netcdf array size maker, so the large grid test will help smoke out these types of issues.

@billsacks
Copy link
Member Author

Thanks, @rgknox - I'll go with that plan, and add a comment to the test list with your notes about the rationale for keeping a high-res Fates test (and I'll keep it a restart test to cover possible memory/netcdf size issues with the restart file).

@billsacks
Copy link
Member Author

@ekluzek looking at the SSP tests: We have three expensive tests:

  • SSP_D_Ld4.f09_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-ciso_rtmColdSSP (285 core-hours)

  • SSP_D_Ld10.f19_g17.I1850Clm50Bgc.cheyenne_intel.clm-rtmColdSSP (160 core-hours)

  • SSP_Ld10.f19_g17.I1850Clm50Bgc.cheyenne_intel.clm-rtmColdSSP (127 core-hours)

Do we need all three of these? Can one or more be done at coarse resolution? (Just because we do spinup at higher-resolution in practice doesn't mean that we need a higher-resolution test to test the code, it seems.) Can they be shortened down from _Ld10?

billsacks added a commit to billsacks/ctsm that referenced this issue Jan 11, 2019
In an effort to reduce the cost of the test suite, I have looked at some
of the most expensive tests
(ESCOMP#275 (comment)).

Some general things I did:
- Changing ERI tests to short SMS tests or coarser-resolution
- Shortening some tests
- Changing some tests to lower-resolution where it doesn't seem
  important to use high-resolution
- Removed some expensive tests for which there were very similar tests
  (or multiple tests that individually covered the various things covered
  in this one test) already in the test suite

Some specific notes:

- ERI_D_Ld9.f09_g16.I1850Clm45BgcCruGs.cheyenne_intel.clm-default: This
  is the only test of this compset (which is scientifically supported),
  and the only ERI Clm45 test. I changed
  ERP_P36x2_D_Ld5.f10_f10_musgs.I1850Clm45Bgc.cheyenne_gnu.clm-default
  (which is a duplicate of an intel test) to an ERI_D_Ld9 test, so we
  have an ERI Clm45 test. Then I changed the above ERI test to
  SMS_D_Ld1.f09_g17

- SMS_Lm1.f09_g17_gl4.I1850Clm50Bgc.cheyenne_intel.clm-clm50KitchenSink,
  ERS_D_Ld3.f09_g17_gl4.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50KitchenSink,
  ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50KSinkMOut:
  We don't need clm50KitchenSink or clm50KSinkMOut any more, since they
  just have the clm50 options.  However, I need to then have a test of
  human stress indices. Introduced a new "extra_outputs" testmod, and
  have an ERP_P36x2_D_Ld3 test of this at coarse resolution.

- ERI_D_Ld9.f09_g16.I1850Clm50Sp.cheyenne_intel.clm-default: Want a test
  of this scientifically-supported compset. Changed to
  SMS_D_Ld1.f09_g17. Similarly for
  ERP_D_Ld9.f09_g16.I1850Clm50SpCru.cheyenne_intel.clm-default.

- ERI_Ld9.f09_g17.I1850Clm50Bgc.cheyenne_intel.clm-drydepnomegan:
  Changed this to have a coarse-resolution ERI test and a short SMS
  test: Changed
  ERP_Ld5.f10_f10_musgs.I1850Clm50Bgc.cheyenne_gnu.clm-drydepnomegan to
  ERI_Ld9 (there's still an intel version of that ERP test), changed the
  above to SMS_Ld1.

- ERP_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_changeFlags
  Redundant with
  ERP_D_Ld5.f19_g17_gl4.I1850Clm50BgcCrop.cheyenne_intel.clm-glcMEC_changeFlags
  Looking through the history, I think it was an accident that we have
  both. I'm keeping the one without evolving ice sheet, because I don't
  think we gain anything from having an evolving ice sheet in this short
  test.

Addresses ESCOMP#275
billsacks added a commit to billsacks/ctsm that referenced this issue Jan 11, 2019
As per recent discussion with Ryan Knox in
ESCOMP#275
@billsacks
Copy link
Member Author

billsacks commented Feb 26, 2019

In my latest run of the test suite, with the changes in #622 (which include changing / removing expensive tests as well as interpolating out-of-the-box initial conditions files), the test suite cost was about 4300 core-hours (down from the previous 7700 core-hours). This is getting close to what I'd consider an acceptable cost. (Though note that there might be some machine variability: these numbers are each from a single run of the test suite.)

billsacks added a commit that referenced this issue Feb 26, 2019
Interpolate out-of-the-box initial conditions and remove expensive tests

Two main changes (plus some small additional changes):

1. Removed / reworked some expensive tests

2. Interpolated all out-of-the-box initial conditions, so that the
   out-of-the-box version is now compatible with our current
   configuration. The changes from before were (a) our standard
   configuration now uses the gx1v7 rather than gx1v6 land mask; (b)
   many inactive points are now absent in memory.

See #622 for details.

- Resolves #312
- Partially addresses #275 (just a bit)
slevis-lmwg added a commit to slevis-lmwg/ctsm that referenced this issue Feb 26, 2019
Interpolate out-of-the-box initial conditions and remove expensive tests

Two main changes (plus some small additional changes):

1. Removed / reworked some expensive tests

2. Interpolated all out-of-the-box initial conditions, so that the
   out-of-the-box version is now compatible with our current
   configuration. The changes from before were (a) our standard
   configuration now uses the gx1v7 rather than gx1v6 land mask; (b)
   many inactive points are now absent in memory.

See ESCOMP#622 for details.

- Resolves ESCOMP#312
- Partially addresses ESCOMP#275 (just a bit)

Conflicts resolved:
	doc/ChangeLog
	doc/ChangeSum
mariuslam pushed a commit to NordicESMhub/ctsm that referenced this issue Aug 26, 2019
ncar compatible previous history merger
@billsacks
Copy link
Member Author

At the urging of @mvertens I started to look into what I thought would be the lowest-hanging fruit here: changing CISM2%NOEVOLVE tests to use SGLC.

The problem is that we have 34 I compsets that use CISM2%NOEVOLVE that are (or at least were within the last couple of years) deemed useful scientifically. (These represent a mix of CTSM options, time periods, datm modes, etc.) So this leads to two issues:

(1) We'd need to add probably 20 – 30 more I compsets that are just used for testing (not quite 34, because we already have some of the relevant SGLC compsets), further exacerbating our compset explosion.

(2) If we want all of the CISM2%NOEVOLVE compsets to be considered to be tested, then we'd need to keep at least one test for each of these 34 compsets. In some sense this makes our testing problem worse rather than better... or at least, we'd be improving our testing turnaround at the expense of greater complexity in our test list (the need to ensure that all of these compsets are tested).

So I feel like the solution may be to do one of the following:

(a) Live with the fact that many of the compsets used for production runs won't have any tests (since we'd only be testing the SGLC variation instead)

(b) Stop using CISM in most production runs – basically, doing away with the CISM2%NOEVOLVE configuration and replacing it with SGLC. The implications of this are laid out here https://escomp.github.io/cism-docs/cism-in-cesm/versions/master/html/clm-cism-coupling.html#stub-glc-model-cism-absent. We could nearly erase the effect of point 1 there if we updated CLM's glacier raw dataset to match what is currently used in CISM. (This would have some long-term maintenance cost to keep the two in sync, but CISM doesn't seem to update its datasets all that often.) Point 3 there may be hard to swallow by the LIWG.

(c) Introducing a data glc model and using that in place of CISM2%NOEVOLVE, both for science and for testing. (See also ESMCI/cime#2574.)

I personally feel that (c) is probably the best option. As a side note: Erik has been asked to make a data glc model for his land ice work for Miren Vizcaino; although this is a relatively low priority for his work, it demonstrates that a data glc model would have some scientific as well as software value.

@billsacks
Copy link
Member Author

We are thinking of moving ahead with the removal of CISM2%NOEVOLVE after all. Referring to the above list of options: In the short-term we are going to go with (b) (#1135 ), and in the longer-term we are going to go with (c) (#1136 ).

@billsacks
Copy link
Member Author

Data point from my most recent run of the test suite on cheyenne:

  • The full test suite took about 5100 core-hours for the test runs, plus 230 additional core-hours for the build tasks (intel & gnu)
  • The intel build task took 5 hours 53 min to complete; the gnu build took 47 min to complete
  • The full test suite was done running around the time the intel build completed – so the limiting factor is the build time
  • We have two tests that take over an hour to run, both of which are single-point tests (ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRsGs.cheyenne_intel.clm-cropMonthOutput and SMS_D_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm45BgcCropQianRsGs.cheyenne_intel.clm-cropMonthOutput); other than those, all tests took less than 45 min

@billsacks
Copy link
Member Author

Somehow it seems like my previous accounting missed a few very expensive tests at C96 resolution, and possibly a few others as well. In my most recent run:

  • The full test suite took about 7200 core-hours for the test runs, plus 230 additional core-hours for the build tasks (intel & gnu)
  • The intel build task took 5 hours 47 min to complete; the gnu build took 44 min to complete
  • This time the machine was apparently more heavily loaded, because many of the f19 and f09 tests did not finish running until up to 7 hours after the intel build task completed. The new ne0 and C96 tests did not finish until about 24 hours later!
  • This time, SMS_D_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm45BgcCropQianRsGs.cheyenne_intel.clm-cropMonthOutput took only 42 min; ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRsGs.cheyenne_intel.clm-cropMonthOutput still took over an hour; all other tests still took less than 45 min.

@billsacks
Copy link
Member Author

With the changes for #1135 the test suite runs in about 2.5 hours, thanks to the cheyenne-intel build time being reduced to just over 2 hours! (#1135 (comment)). I'd still like to do the things mentioned in this issue, as well as a general test suite overhaul, in part because I foresee a continued gradual expansion of the test suite as we support more and more configurations. But I'm happy that, for now, we have returned to a sub-4-hour test suite; I think this will help significantly with our testing and tagging workflow.

@billsacks
Copy link
Member Author

Regarding PTS_MODE testing: I am planning to reduce PTS_MODE testing somewhat in my next tag. @ekluzek points out:

PTS_MODE is likely to become more important after we are able to bring in John T's changes that get restarts working in PTS_MODE and SCAM. That will make PTS_MODE a more useful option. But, when that happens will be the time to possibly add a few more tests. We'll need to add some exact restart tests for example. But, we should wait until that happens to do so. In those previous notes I was thinking that maybe PTS_MODE was going away, but with restarts it'll become more important.

My reply:

I'd suggest that, once those fixes are in place, we simply change the existing PTS_MODE tests to exact restart tests. I actually think that, even with the removals I made, we still have sufficient coverage of PTS_MODE: I mainly removed non-debug tests, and my experience is that it's very rare for non-debug tests to provide additional testing value beyond debug-mode tests: it's good for us to have a handful of non-debug tests, but we don't need widespread coverage of them. (Times when I can remember issues in non-debug tests that didn't show up in debug tests were ones where we ended up deciding there was some weird compiler behavior rather than any actual issues in our code.)

billsacks added a commit to billsacks/ctsm that referenced this issue Feb 24, 2022
Main motivation is to avoid the share-queue-related issues we keep
hitting on cheyenne, and also move to slightly better balance between
our izumi and cheyenne testing, reducing our overall test turnaround
time. I am also combining or cutting single point tests that feel
particularly redundant or unnecessary.

Partially addresses ESCOMP#275
@billsacks
Copy link
Member Author

After the changes in #1660 my testing last night (which I started around 10 pm, so was probably using relatively lightly loaded machines) took the following time:

  • cheyenne_gnu: Build finished 35 minutes after starting (I think); last test finished 46 minutes after starting testing
  • cheyenne_intel: Build finished 2 hours, 27 minutes after starting (I think); last test also finished 2 hours, 27 minutes after starting testing
  • izumi_pgi: Last test finished 14 minutes after starting testing
  • izugi_gnu: Last test finished 38 minutes after starting testing
  • izumi_nag: Last test finished 43 minutes after starting testing
  • izumi_intel: Last test finished 1 hour, 14 minutes after starting testing

I'm reasonably happy with this, though it would be great if we could still move some cheyenne_intel tests to elsewhere, bringing the total test time to more like 1.5 or 2 hours.

@billsacks
Copy link
Member Author

I looked into the total core-hour requirement from recent runs of the aux_clm test suite. The total core-hour requirement is about 6200 core-hours (there was close agreement in my two most recent runs of the full aux_clm test suite), plus about another 100 core-hours for the create_test tasks to do all of the builds. It would be great if we could reduce this somewhat: I think a good goal would be to get this down to about 4000 core-hours.

There is some low-hanging fruit here: the 5 most expensive tests account for nearly 25% of the total core-hour cost of our test suite: see attached spreadsheet for details:
aux_clm_sorted-0224-155017ch.csv

These five tests are:

SMS_C2_D_Lh12.f09_g17.I2000Clm50Sp.cheyenne_intel.clm-pauseResume
ERI_C2_Ld9.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-default
ERP_D_Ld3.f09_g17.I2000Clm50Sp.cheyenne_intel.clm-prescribed
ERP_D_Ld5.f09_g17.I2000Clm50Vic.cheyenne_intel.clm-vrtlay
SSP_D_Ld4.f09_g17.I1850Clm50BgcCrop.cheyenne_intel.clm-ciso_rtmColdSSP

@ekluzek other than the "prescribed" test (which I know needs to be at f09), do you know if some or all of these others can safely be moved to coarse resolution to help save about 20% of our aux_clm testing cost?

For tests that need to stay at high resolution, one partial solution may be to use a smaller PE layout. For example, I notice that the prescribed test has about 4x the cost with nuopc/cmeps/cdeps as it did with mct. It looks like nearly all of this additional cost is in model initialization: model initialization takes much longer than the model run time for this test. My guess is that the model initialization time would not change much (and might even get faster) if we reduce this to a relatively small PE count, which could give a significant cost savings.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 10, 2022

Yes, the prescribed has to be at f09. You could maybe make it shorter though. You'd probably have to make it something like 9 steps.

The I2000Clm50Vic test uses VIC which we only have a few surface datasets to work with. But, it does look like you could use 2-degree or even 10x15, so I'd recommend using 10x15 since we did add that capability for low resolution.

In principle one reason for having some f09 and f19 tests was because that's the workhorse resolution for science. But, I think it would be OK to say that the infrequent running of the ctsm_sci is sufficient to count for that. So lower resolution testing for our standard testing sounds like the right way to go.

@billsacks
Copy link
Member Author

What about the SSP test, the pauseResume test and ERI_C2_Ld9.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-default? Is there any reason why those need to be at f09 / f19 resolution?

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 10, 2022

What about the SSP test, the pauseResume test and ERI_C2_Ld9.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-default? Is there any reason why those need to be at f09 / f19 resolution?

I couldn't see any reason they had to be at f09 or f19. I think it was just to make sure we had some tests at those resolutions. So I would try to move them to f10 and see if it works. We'll just rely on the ctsm_sci test list to make sure that f09/f19 keep working

billsacks added a commit to billsacks/ctsm that referenced this issue Mar 14, 2022
A few tests were responsible for a large fraction of the total cheyenne
core-hour cost of aux_clm testing (see
ESCOMP#275 (comment)). This
commit reduces or eliminates some of these most expensive tests by:

(1) Eliminating some redundant tests

(2) Changing the resolution to a coarse resolution where possible

(3) Using a small processor count (_PS) where we need to maintain a high
    resolution. (The run time of these changed tests is about the same
    as before, since most of the time was being spent in model
    initialization, but now the cost is much lower.)

In some cases, I have moved tests to the ctsm_sci test list, where it
felt helpful to have occasional tests of these production resolutions.
@billsacks
Copy link
Member Author

The comments from the last few days are addressed in 3244887 . See the diff and commit message in that commit for details. I have verified that all of the changed tests pass.

Using a small PE count for f09 tests seems to be a good approach that we could apply to other tests. For now I have just applied it to these two particularly expensive tests:

ERP_D_Ld3_PS.f09_g17.I2000Clm50Sp.cheyenne_intel.clm-prescribed
ERP_D_Ld3_Vmct_PS.f09_g17.I2000Clm50Sp.cheyenne_intel.clm-prescribed

This "small" f09 PE layout uses 5 nodes instead of 51, yet gives about the same test run time as before (because most of the time spent in these tests is in model initialization) – thus cutting the expense by about a factor of 10, and generally leading to lower queue wait times. (Actually, the nuopc version of this test is faster with a much smaller processor count.) I haven't tried doing this for other tests yet, but I think we should strongly consider this moving forward.

@billsacks
Copy link
Member Author

With the changes in #1688, where I changed high resolution tests to use only 4 nodes and removed some redundant / unnecessary tests, I have managed to reduce the core-hour cost of the test suite substantially, and reduced the turnaround time a bit.

The total core-hour cost is now just 1774 core-hours, plus another 101 core-hours for the builds. This is down from 6200 core-hours as of a few weeks ago (#275 (comment) ). We have a few 5-node tests – the f45 tests; otherwise, all tests use 4 or fewer nodes. Therefore, in addition to the cost being lower, it should almost always be reasonable to run in the regular queue rather than premium, further reducing the cost of our testing.

Total testing time from start to finish from my last run of the test suite was:

  • cheyenne_intel: 2:10
  • cheyenne_gnu: 1:06
  • izumi_intel: 1:46
  • izumi_nag: 1:31
  • izumi_pgi: 1:27
  • izumi_gnu: 1:09

Note that I think the izumi tests would have completed sooner if not for the fact that we are limited in how many jobs we can run simultaneously in izumi, so jobs tend to stay stuck in the queue for a while.

The test that I was concerned about taking a longer time to run, SMS_Lm13_PS.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-cropMonthOutput, was the second-to-last cheyenne test to finish, but a bunch of other tests finished just one minute before it (and ERS_Ly5_P144x1.f10_f10_mg37.IHistClm51BgcCrop.cheyenne_intel.clm-cropMonthOutput, the last test to finish, finished one minute after it).

I still feel like it would be great if we could continue to remove some redundant tests, and/or move some more testing to gnu to get better balance between the different compilers (though I've been reluctant to move too much debug testing to gnu due to ESMCI/ccs_config_cesm#4 ). A reliably sub-2-hour turnaround time would be great, and the lower the better (as long as we're still covering everything important). But I'm much happier with how things are now.

@samsrabin samsrabin added the performance idea or PR to improve performance (e.g. throughput, memory) label Feb 8, 2024
samsrabin pushed a commit to samsrabin/CTSM that referenced this issue May 9, 2024
…ter_test_data_list

Revert "make the amount of data required to download minimal for testing"
@ekluzek ekluzek moved this to In progress in CTSM: Rework test list Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance idea or PR to improve performance (e.g. throughput, memory) testing additions or changes to tests
Projects
Status: In progress
Development

No branches or pull requests

6 participants