-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework CLM test suite: miscellaneous things to do to shorten turnaround time and increase coverage #275
Comments
Up until now, the majority of our testing has been on cheyenne (or before that, yellowstone), with a smaller set of tests on hobart. But recently, hobart seems to be more stable than cheyenne. So I'm wondering if we should move most tests to hobart, mainly keeping some longer and higher-resolution tests on cheyenne. |
Would it be possible at some point to get a subset of the CTSM tests that can be run on an arbitrary UNIX machine and to have those test be run on a Continuous Integration service like TravisCI or CircleCI? Just something to consider going forward, this tends to be standard for open-source projects these days and really helps maintain portability and usability of the model. A starting point may just be to build a standard compset with generic libraries. |
All
I would strongly second this! When working with the NGEE-Tropics version of FATES I was running the test suite after each new PR (e.g. clm_45_short, clm_5_short, fates tests), I would really like to continue this with CTSM tests on our machine (modex) for CLM and CLM-FATES.
Perhaps this is already available with just some simple tweaks of the call to the test exe, but I am still trying to learn my way around this code base.
…-------------------------------------------------------------------------
Shawn P. Serbin
Associate Scientist
Environmental & Climate Sciences Department
Brookhaven National Laboratory
Upton, NY 11973-5000
Phone: +1-631-344-3165
Fax: +1-631-344-2060
Email: sserbin@bnl.gov<mailto:sserbin@bnl.gov>
Bio: http://www.bnl.gov/envsci/bio/serbin-shawn.php
TEST Group: http://www.bnl.gov/TEST/
ORCiD: http://orcid.org/0000-0003-4136-8971
Twitter: @doctorjackpine, @test_bnl
-------------------------------------------------------------------------
________________________________
From: Joe Hamman [notifications@github.com]
Sent: Tuesday, February 13, 2018 4:08 PM
To: ESCOMP/ctsm
Cc: Subscribed
Subject: Re: [ESCOMP/ctsm] Rework CLM test suite: miscellaneous things to do (#275)
Would it be possible at some point to get a subset of the CTSM tests that can be run on an arbitrary UNIX machine and to have those test be run on a Continuous Integration service like TravisCI or CircleCI? Just something to consider going forward, this tends to be standard for open-source projects these days and really helps maintain portability and usability of the model. A starting point may just be to build a standard compset with generic libraries.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#275 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC7N6TbgUVAP4_Yg9lbm1LdDpsU25xv1ks5tUfm6gaJpZM4SEHrK>.
|
@jhamman and @serbinsh good points - thanks. I have moved the discussion of a CI server to #278 because I feel that warrants its own issue. Regarding a short test suite: we do have one ( |
Regarding this:
I'm still feeling it would be helpful to move more tests to hobart. At a minimum, we might as well move enough tests from cheyenne to hobart that the cheyenne and hobart turnaround times are about the same, in order to decrease total time to completion of the test suite. |
It might be worth splitting the This would make it slightly more complicated to run and check the test results, though that could be mitigated if we had a wrapper script that helps with these things, which would be good to have anyway. The tests could be separated arbitrarily, or along some logical division (e.g., debug vs. non-debug tests; note that separating this on build-related aspects could help a bit due to the shared builds used in testing – otherwise, both test suites would end up doing some of the same builds, which would slow things down a bit though probably wouldn't be a huge deal). Another way to do this split is if we separate out the tests for which we want to generate/compare baselines (as I think I have discussed elsewhere): We could have a separate test category for which we do baseline comparison/generation, which is a subset of the full test suite. [Update (2020-03-15)] With the |
Looking at some recent runs of the test suite: Our full cheyenne test suite takes about 7700 core-hours, plus about 200 core-hours to do all of the test builds. I am attaching a spreadsheet with the tests sorted by core-hour usage, along with the cumulative time. Out of the 188 tests, 18 are responsible for > 50% of the total test suite time. I'm going to look into cutting or changing those tests to reduce our cheyenne usage. (Note: In August, 2016, the test suite cost about 4000 core-hours. It would be great if we could get back down to about that level or lower.) |
@billsacks, we test fates with 1 test using an f09_g16 grid, which seems to be very expensive according to your chart (which makes sense being so large..). It does not seem to be a bottleneck in terms of holding up the whole test suite, but if its impacting the testing quota then I could find ways to reduce/remove this one. |
@rgknox FATES doesn't seem to be a big culprit here (or maybe you're referring to the fates test list, which I haven't looked at). |
I don't think so... we could at least get rid of Clm45Fates f09_g17
Yes to non-debug and shortened, good ideas. In general, if fates tests are not a huge time cuplrit, I would just like to maintain one fates test on a large grid (f09-ish). Since FATES has cohorts, it has potential to be a massive memory consumer and netcdf array size maker, so the large grid test will help smoke out these types of issues. |
Thanks, @rgknox - I'll go with that plan, and add a comment to the test list with your notes about the rationale for keeping a high-res Fates test (and I'll keep it a restart test to cover possible memory/netcdf size issues with the restart file). |
@ekluzek looking at the SSP tests: We have three expensive tests:
Do we need all three of these? Can one or more be done at coarse resolution? (Just because we do spinup at higher-resolution in practice doesn't mean that we need a higher-resolution test to test the code, it seems.) Can they be shortened down from _Ld10? |
In an effort to reduce the cost of the test suite, I have looked at some of the most expensive tests (ESCOMP#275 (comment)). Some general things I did: - Changing ERI tests to short SMS tests or coarser-resolution - Shortening some tests - Changing some tests to lower-resolution where it doesn't seem important to use high-resolution - Removed some expensive tests for which there were very similar tests (or multiple tests that individually covered the various things covered in this one test) already in the test suite Some specific notes: - ERI_D_Ld9.f09_g16.I1850Clm45BgcCruGs.cheyenne_intel.clm-default: This is the only test of this compset (which is scientifically supported), and the only ERI Clm45 test. I changed ERP_P36x2_D_Ld5.f10_f10_musgs.I1850Clm45Bgc.cheyenne_gnu.clm-default (which is a duplicate of an intel test) to an ERI_D_Ld9 test, so we have an ERI Clm45 test. Then I changed the above ERI test to SMS_D_Ld1.f09_g17 - SMS_Lm1.f09_g17_gl4.I1850Clm50Bgc.cheyenne_intel.clm-clm50KitchenSink, ERS_D_Ld3.f09_g17_gl4.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50KitchenSink, ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCrop.cheyenne_intel.clm-clm50KSinkMOut: We don't need clm50KitchenSink or clm50KSinkMOut any more, since they just have the clm50 options. However, I need to then have a test of human stress indices. Introduced a new "extra_outputs" testmod, and have an ERP_P36x2_D_Ld3 test of this at coarse resolution. - ERI_D_Ld9.f09_g16.I1850Clm50Sp.cheyenne_intel.clm-default: Want a test of this scientifically-supported compset. Changed to SMS_D_Ld1.f09_g17. Similarly for ERP_D_Ld9.f09_g16.I1850Clm50SpCru.cheyenne_intel.clm-default. - ERI_Ld9.f09_g17.I1850Clm50Bgc.cheyenne_intel.clm-drydepnomegan: Changed this to have a coarse-resolution ERI test and a short SMS test: Changed ERP_Ld5.f10_f10_musgs.I1850Clm50Bgc.cheyenne_gnu.clm-drydepnomegan to ERI_Ld9 (there's still an intel version of that ERP test), changed the above to SMS_Ld1. - ERP_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.cheyenne_intel.clm-glcMEC_changeFlags Redundant with ERP_D_Ld5.f19_g17_gl4.I1850Clm50BgcCrop.cheyenne_intel.clm-glcMEC_changeFlags Looking through the history, I think it was an accident that we have both. I'm keeping the one without evolving ice sheet, because I don't think we gain anything from having an evolving ice sheet in this short test. Addresses ESCOMP#275
As per recent discussion with Ryan Knox in ESCOMP#275
In my latest run of the test suite, with the changes in #622 (which include changing / removing expensive tests as well as interpolating out-of-the-box initial conditions files), the test suite cost was about 4300 core-hours (down from the previous 7700 core-hours). This is getting close to what I'd consider an acceptable cost. (Though note that there might be some machine variability: these numbers are each from a single run of the test suite.) |
Interpolate out-of-the-box initial conditions and remove expensive tests Two main changes (plus some small additional changes): 1. Removed / reworked some expensive tests 2. Interpolated all out-of-the-box initial conditions, so that the out-of-the-box version is now compatible with our current configuration. The changes from before were (a) our standard configuration now uses the gx1v7 rather than gx1v6 land mask; (b) many inactive points are now absent in memory. See #622 for details. - Resolves #312 - Partially addresses #275 (just a bit)
Interpolate out-of-the-box initial conditions and remove expensive tests Two main changes (plus some small additional changes): 1. Removed / reworked some expensive tests 2. Interpolated all out-of-the-box initial conditions, so that the out-of-the-box version is now compatible with our current configuration. The changes from before were (a) our standard configuration now uses the gx1v7 rather than gx1v6 land mask; (b) many inactive points are now absent in memory. See ESCOMP#622 for details. - Resolves ESCOMP#312 - Partially addresses ESCOMP#275 (just a bit) Conflicts resolved: doc/ChangeLog doc/ChangeSum
ncar compatible previous history merger
At the urging of @mvertens I started to look into what I thought would be the lowest-hanging fruit here: changing CISM2%NOEVOLVE tests to use SGLC. The problem is that we have 34 I compsets that use CISM2%NOEVOLVE that are (or at least were within the last couple of years) deemed useful scientifically. (These represent a mix of CTSM options, time periods, datm modes, etc.) So this leads to two issues: (1) We'd need to add probably 20 – 30 more I compsets that are just used for testing (not quite 34, because we already have some of the relevant SGLC compsets), further exacerbating our compset explosion. (2) If we want all of the CISM2%NOEVOLVE compsets to be considered to be tested, then we'd need to keep at least one test for each of these 34 compsets. In some sense this makes our testing problem worse rather than better... or at least, we'd be improving our testing turnaround at the expense of greater complexity in our test list (the need to ensure that all of these compsets are tested). So I feel like the solution may be to do one of the following: (a) Live with the fact that many of the compsets used for production runs won't have any tests (since we'd only be testing the SGLC variation instead) (b) Stop using CISM in most production runs – basically, doing away with the CISM2%NOEVOLVE configuration and replacing it with SGLC. The implications of this are laid out here https://escomp.github.io/cism-docs/cism-in-cesm/versions/master/html/clm-cism-coupling.html#stub-glc-model-cism-absent. We could nearly erase the effect of point 1 there if we updated CLM's glacier raw dataset to match what is currently used in CISM. (This would have some long-term maintenance cost to keep the two in sync, but CISM doesn't seem to update its datasets all that often.) Point 3 there may be hard to swallow by the LIWG. (c) Introducing a data glc model and using that in place of CISM2%NOEVOLVE, both for science and for testing. (See also ESMCI/cime#2574.) I personally feel that (c) is probably the best option. As a side note: Erik has been asked to make a data glc model for his land ice work for Miren Vizcaino; although this is a relatively low priority for his work, it demonstrates that a data glc model would have some scientific as well as software value. |
Data point from my most recent run of the test suite on cheyenne:
|
Somehow it seems like my previous accounting missed a few very expensive tests at C96 resolution, and possibly a few others as well. In my most recent run:
|
With the changes for #1135 the test suite runs in about 2.5 hours, thanks to the cheyenne-intel build time being reduced to just over 2 hours! (#1135 (comment)). I'd still like to do the things mentioned in this issue, as well as a general test suite overhaul, in part because I foresee a continued gradual expansion of the test suite as we support more and more configurations. But I'm happy that, for now, we have returned to a sub-4-hour test suite; I think this will help significantly with our testing and tagging workflow. |
Regarding PTS_MODE testing: I am planning to reduce PTS_MODE testing somewhat in my next tag. @ekluzek points out:
My reply:
|
Main motivation is to avoid the share-queue-related issues we keep hitting on cheyenne, and also move to slightly better balance between our izumi and cheyenne testing, reducing our overall test turnaround time. I am also combining or cutting single point tests that feel particularly redundant or unnecessary. Partially addresses ESCOMP#275
After the changes in #1660 my testing last night (which I started around 10 pm, so was probably using relatively lightly loaded machines) took the following time:
I'm reasonably happy with this, though it would be great if we could still move some cheyenne_intel tests to elsewhere, bringing the total test time to more like 1.5 or 2 hours. |
I looked into the total core-hour requirement from recent runs of the aux_clm test suite. The total core-hour requirement is about 6200 core-hours (there was close agreement in my two most recent runs of the full aux_clm test suite), plus about another 100 core-hours for the create_test tasks to do all of the builds. It would be great if we could reduce this somewhat: I think a good goal would be to get this down to about 4000 core-hours. There is some low-hanging fruit here: the 5 most expensive tests account for nearly 25% of the total core-hour cost of our test suite: see attached spreadsheet for details: These five tests are:
@ekluzek other than the "prescribed" test (which I know needs to be at f09), do you know if some or all of these others can safely be moved to coarse resolution to help save about 20% of our aux_clm testing cost? For tests that need to stay at high resolution, one partial solution may be to use a smaller PE layout. For example, I notice that the prescribed test has about 4x the cost with nuopc/cmeps/cdeps as it did with mct. It looks like nearly all of this additional cost is in model initialization: model initialization takes much longer than the model run time for this test. My guess is that the model initialization time would not change much (and might even get faster) if we reduce this to a relatively small PE count, which could give a significant cost savings. |
Yes, the prescribed has to be at f09. You could maybe make it shorter though. You'd probably have to make it something like 9 steps. The I2000Clm50Vic test uses VIC which we only have a few surface datasets to work with. But, it does look like you could use 2-degree or even 10x15, so I'd recommend using 10x15 since we did add that capability for low resolution. In principle one reason for having some f09 and f19 tests was because that's the workhorse resolution for science. But, I think it would be OK to say that the infrequent running of the ctsm_sci is sufficient to count for that. So lower resolution testing for our standard testing sounds like the right way to go. |
What about the SSP test, the pauseResume test and |
I couldn't see any reason they had to be at f09 or f19. I think it was just to make sure we had some tests at those resolutions. So I would try to move them to f10 and see if it works. We'll just rely on the ctsm_sci test list to make sure that f09/f19 keep working |
A few tests were responsible for a large fraction of the total cheyenne core-hour cost of aux_clm testing (see ESCOMP#275 (comment)). This commit reduces or eliminates some of these most expensive tests by: (1) Eliminating some redundant tests (2) Changing the resolution to a coarse resolution where possible (3) Using a small processor count (_PS) where we need to maintain a high resolution. (The run time of these changed tests is about the same as before, since most of the time was being spent in model initialization, but now the cost is much lower.) In some cases, I have moved tests to the ctsm_sci test list, where it felt helpful to have occasional tests of these production resolutions.
The comments from the last few days are addressed in 3244887 . See the diff and commit message in that commit for details. I have verified that all of the changed tests pass. Using a small PE count for f09 tests seems to be a good approach that we could apply to other tests. For now I have just applied it to these two particularly expensive tests:
This "small" f09 PE layout uses 5 nodes instead of 51, yet gives about the same test run time as before (because most of the time spent in these tests is in model initialization) – thus cutting the expense by about a factor of 10, and generally leading to lower queue wait times. (Actually, the nuopc version of this test is faster with a much smaller processor count.) I haven't tried doing this for other tests yet, but I think we should strongly consider this moving forward. |
With the changes in #1688, where I changed high resolution tests to use only 4 nodes and removed some redundant / unnecessary tests, I have managed to reduce the core-hour cost of the test suite substantially, and reduced the turnaround time a bit. The total core-hour cost is now just 1774 core-hours, plus another 101 core-hours for the builds. This is down from 6200 core-hours as of a few weeks ago (#275 (comment) ). We have a few 5-node tests – the f45 tests; otherwise, all tests use 4 or fewer nodes. Therefore, in addition to the cost being lower, it should almost always be reasonable to run in the regular queue rather than premium, further reducing the cost of our testing. Total testing time from start to finish from my last run of the test suite was:
Note that I think the izumi tests would have completed sooner if not for the fact that we are limited in how many jobs we can run simultaneously in izumi, so jobs tend to stay stuck in the queue for a while. The test that I was concerned about taking a longer time to run, I still feel like it would be great if we could continue to remove some redundant tests, and/or move some more testing to gnu to get better balance between the different compilers (though I've been reluctant to move too much debug testing to gnu due to ESMCI/ccs_config_cesm#4 ). A reliably sub-2-hour turnaround time would be great, and the lower the better (as long as we're still covering everything important). But I'm much happier with how things are now. |
…ter_test_data_list Revert "make the amount of data required to download minimal for testing"
There are a number of changes we could make to shorten the turnaround time of the test suite while increasing coverage. This issue documents some ideas.
I feel like our goal should be a sub-4-hour test suite – or at the very least, sub-4-hours for the subset of the test suite for which baseline comparisons are important (see a comment below for thoughts on possibly separating this subset of the test suite). Edit (2019-01-11) I'd also like to get the core-hour requirement down to something like 2,000, and ideally even lower.
Here are some ideas:
Change a bunch of tests from 1850 to transient: this covers more code (e.g., http://bugs.cgd.ucar.edu/show_bug.cgi?id=2207 slipped through because we didn't test ciso in combination with a transient case)
Rework test suite to get better balance between different compilers, to shorten turnaround time. If you fire off a different test suite for each machine/compiler (which is the standard practice), then the ideal is to have roughly equal balance between the different
machine_compiler
combinations, at least forcheyenne_intel
,cheyenne_gnu
andhobart_nag
. (However, gnu is less good at picking up debug issues, so it's not good to have important configurations only covered by gnu debug.)Remove or rework tests that take a large number of processor-hours.
SMS_Ln9.ne30pg2_ne30pg2_mg17.I2000Clm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode
takes a long time, due to time needed for init_interp.(2018-10-27) Make more tests use SGLC rather than CISM, to shorten build times
(2019-01-23) Remove / rework some tests that have a long wallclock time, unless the long time is really necessary for the test.
SMS_D_Ly2.1x1_numaIA.IHistClm50BgcCropGs.cheyenne_intel.clm-ciso_bombspike1963
(currently takes > 2 hours).(2019-04-29) Consider changing some multi-year tests to short, decStart tests. In particular: we have at least one or two (and possibly more) multi-year tests because of Carbon balance error in decStart test just after new year #404 . With ctsm1.0.dev036, that issue seems to be resolved. We should look critically at our global tests that cross the year boundary and consider if any of them can/should be changed to decStart tests.
(2019-05-01) I don't think we're getting much benefit from long single-point tests like
ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcDvCropQianGs.cheyenne_intel.clm-cropMonthOutput
(and maybe others?), unless they are constructed specially to target particular things (likeERS_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianGs.cheyenne_intel.clm-cropMonthOutput # restart is right before increasing natural veg to > 0 while also shifting PCT_CFT
). I think that, at one point, these long tests were important to test crop more fully, since most crop tests started from cold start and only tested the first few years of crop, at most. Now that nearly everything uses spun-up initial conditions, this is no longer the case. One way to keep these long single-point tests valuable would be to make them use cold start: then they would be testing the first couple of decades of run from cold start, which isn't otherwise tested.(2020-05-20) Substantially trim the PTS mode testing (ptsRLA, ptsRLB, ptsROA tests). Erik says:
Then:
(2020-06-07) We could change a bunch of our tests to use Qian datm forcing in order to speed up datm, since this likely wouldn't decrease the test coverage of the CTSM code. (However, this is only helpful if datm is a limitation in tests; this is true for single-point cases, but I'm not sure if it's true for global cases. We should check whether datm is a limiting factor in our typical f10 cases and other tests; if so, we should consider making this or a similar change.)
(2020-09-18) For tests that use a lot of processors: make some changes so they get through the queue faster:
_P720x1
) or using the mechanism to give multiple PE layouts for a given configuration in config_pes.xmlP360x2
, reducing its TOTALPES from 3528 to 720. This reduced its queue wait time from hours to seconds. I thought about introducing an alternative PE layout inconfig_pes.xml
, but that felt overly complex given that we only need this for testing. (At first I liked that an alternative PE layout could allow us to keep datm on its own processors, but from looking at the test timing, this wouldn't make much, if any, difference.)(2020-10-06) Single point tests on cheyenne have seemed flaky lately, frequently dying due to system issues. My guess is that this is because they are on shared nodes. Is there something we can do about this? We could move them to full nodes on izumi, or maybe even full nodes on cheyenne if they are short enough... though it tends to be the longer tests that are dying. So maybe the solution is just to get rid of the long single-point tests, as noted above.
(2020-10-19) Make sure that all tests on izumi use reduced output, since i/o is often a huge runtime cost of the izumi tests.
(2020-11-04) Consider making all izumi tests single-node, since multi-node tests may be more prone to system issues.
(2021-03-29) Consider setting the co2 coupling flag so that lnd -> atm co2 fluxes are sent in all tests where this is possible, in order to test this coupling.
(2021-04-15) I think we could decrease our ERI testing: It's important to have some ERI testing, but it's probably sufficient to just have a few such tests. I would especially like to reduce or eliminate our ERI testing on izumi: with the multiple runs of this test, it is more prone to the periodic system failures we see on izumi, and there may be problems with rerunning ERI tests (as seen in HLM-side changes to allow FATES snow occlusion of LAI #1324).
The text was updated successfully, but these errors were encountered: