-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use small PE counts for aux_clm tests and remove some redundant / unnecessary tests #1688
Conversation
With this small node count, having datm on its own node only makes things a little faster, and adds significantly to the cost (and might lead to slightly longer queue wait times), so I don't think it's worth it. However, datm is slightly faster with just one node rather than 4, so I'm maintaining that aspect of the PE layout.
I compared the detailed timing files from this new PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel test with the original PFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel test. For the most part, the relative timings of different pieces are about the same in both - at least for the major culprits, which is what I compared. So this is good: I feel like we can use the new test to still give indications of when the timing of a particular section changes significantly, and be reasonably confident that the same result would hold for an f09 test with higher processor count.
Main goal is to reduce the PE count for aux_clm tests so that all tests fit in 4 nodes or less. This should both (1) reduce queue wait time for tests and (2) reduce aux_clm testing cost (since, for high resolution tests, most time is spent in initialization, so having more PEs increases cost significantly without actually making the tests run much, if any, faster). While doing this, I also noticed some tests that felt redundant or unnecessary, and I have removed those. I will give more complete documentation of the changes in the associated Pull Request.
@ekluzek I'd welcome your sign-off on this before I make a tag with these changes, if you'd like to give it a quick look. I'd suggest just looking at the detailed comment I made above rather than trying to look at the diffs in the testlist xml, which can be hard to look at. Note that I have a specific question about the drydepnomegan tests: I removed two tests of drydepnomegan and changed some others to lower resolution; is this okay, or is there particular value in having high-resolution tests of drydepnomegan? @glemieux @rgknox note that I have NOT changed any fates category tests; I wanted to leave it up to you to make any changes to that test category. The main changes needed would be adding a |
@billsacks the main reason the drydep test is at f09 is that the time when drydep matters are for cases coupled to CAM with chemistry which are going to be done at f09 or f19 and never at lower resolution. However, we haven't run into trouble with drydep, so I think it's OK to relax the resolution here for our standard testing. Especially with the idea that we'll keep the higher resolution testing in the ctsm_sci test list which we'll run at a lower frequency. When we had everything in aux_clm I would be more reluctant to lower it. |
The other thing I looked into in terms of resolution was to look at the VOC raw dataset, which we have at half degree resolution, and it's got really sharp gradients in the dataset. This comes into play for most of our tests as we by default turn MEGAN on for most. But, the main time that VOC's matter is again when coupled to CAM with chemistry on, so will apply to higher resolution cases. A half degree case is considered the standard resolution for full chemistry. But, the question for our testing is can we detect a problem with drydep or VOC that would impact higher resolution chemistry simulations for an I compset? And probably the main issue is going to be dataset issues that could be found at any resolution. That coupled with the fact that we haven't run into trouble with drydep or VOC's, makes me think it would be OK to relax the resolution back to something more course. We should just make sure the ctsm_sci test list has a good list of resolutions for these full chemistry simulations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really glad you are doing this. In practice I know how difficult it can be to make these assessments, which is probably why we tend to avoid it. But, improving the performance of our testing can really improve our entire workflow of everyone, so it's worth doing.
I agree with most of what you are doing here. I think two of your PE layout changes are wrong, but we should go over that. I also recommend adding back the flooding test, but at a lower resolution.
There's one test you moved to ctsm_sci and I just think there should be a comment on why it's there.
I added a checkbox for things that I think something needs to happen. For other cases I just agree with your assessment, so you don't need to do anything.
<rootpe_glc>-1</rootpe_glc> | ||
<rootpe_wav>-1</rootpe_wav> | ||
<rootpe_cpl>-1</rootpe_cpl> | ||
<rootpe_lnd>0</rootpe_lnd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong to me as it looks like the f19 layout for izumi/hobart is concurrent with one node for datm, and this is starting everything on the datm root node.
- Make sure this is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@billsacks explains below, but basically datm doesn't scale well so it's better off running on only one node, but sequential with CTSM. The higher node count setups rely on datm running on 1 node running about as fast as many nodes for CTSM, with smaller node counts we can't do that.
@@ -2411,8 +2313,7 @@ | |||
</test> | |||
<test name="SMS_Lm1" grid="f19_g17" compset="I1850Clm50BgcCropCmip6waccm" testmods="clm/basic"> | |||
<machines> | |||
<machine name="cheyenne" compiler="intel" category="aux_clm"/> | |||
<machine name="cheyenne" compiler="intel" category="prebeta"/> | |||
<machine name="cheyenne" compiler="intel" category="ctsm_sci"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add a note here about why this test is in the ctsm_sci test list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I just said that I'd do this, but I realized that there already is a comment on this test (you need to expand the diff below to see it: "The main point of this test is simply to make sure that the CMIP6WACCMDECK modifier works for 2-degree since that resolution turns off Carbon isotopes"). I'm not sure what else to say besides that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I looked at the file to see what this looks like. My point is that the other ctsm_sci tests are all for running the science supported compsets/resolutions. This isn't listed as a science_supported compset or resolution so I think it warrants saying why we have this in ctsm_sci. So I suggest adding something like this...
This isn't a science supported compset, but it's critical to run this test at 2-degree resolution which is higher than our standard testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the clarification; I'll add a note like that.
Thanks a lot for your careful look at this, Erik, and for sharing your thoughts. I agree with the thoughts you shared in your initial comments on the PR – that, although some of these tests are mainly / only relevant at higher resolutions, we probably have good enough code coverage of the relevant code by running lower resolution tests in aux_clm, saving the higher resolution versions for ctsm_sci. (Really, the same can probably be said of most of our configurations.) Regarding the PE layout changes: I initially started with layouts with datm concurrent with lnd, as we do for our production PE layouts. However, I found that that doesn't seem to make sense for these small PE layouts. The reason why it makes sense for the large PE layouts is that the datm run time is roughly similar to the lnd run time. However, with these small PE layouts for f09 and f19, the time is dominated by the lnd run time. So making datm concurrent with lnd leads to a situation where the datm node is sitting idle for most of the run. With the current PE layout, the non-datm nodes sit idle while datm runs, but that turns out to be less wasteful than having the datm node sit idle for most of the run. To put some specific numbers on it: For the f19 case, datm runs at 0.474 seconds/mday, and lnd runs at 3.857 seconds/mday. So the PE layout I have in place means that 3/4 nodes sit idle for about 11% of the run time while datm runs. If we instead made datm concurrent with lnd, we would have a situation where 1/5 node sits idle for about 89% of the run time. So the present PE layout is somewhat lower cost, at a slight additional cost in run time (but potentially less queue wait time since we're only asking for 4 nodes rather than 5). The difference is bigger in the f09 case, where datm runs at 0.864 seconds/mday and lnd at 10.518 seconds/mday, so the 3/4 nodes are just sitting idle for 8% of the run time, whereas a concurrent layout would lead to 1/5 node sitting idle for 92% of the run time. The differences here aren't huge, and if you'd rather have tests that are slightly faster but also slightly more expensive, I'm fine changing these layouts so datm is running concurrently. But I just wanted to point out that what makes sense for high PE counts doesn't make sense in all the same ways for these lower PE counts. Regarding the flooding tests: we originally had two, a non-debug and a debug one: ERP_Ld5.f19_g17.I2000Clm50SpRtmFl.cheyenne_gnu.clm-default
ERP_P180x2_D.f19_g17.I2000Clm50SpRtmFl.cheyenne_intel.clm-default I moved the debug one to lower resolution ( Finally, I will add a comment about the new test in the ctsm_sci test list: thanks for that suggestion. |
Description of changes
This PR reduces the PE count for aux_clm tests so that all (I think) tests fit in 4 nodes or less. This should
both (1) reduce queue wait time for tests and (2) reduce aux_clm testing cost (since, for high resolution
tests, most time is spent in initialization, so having more PEs increases cost significantly without actually
making the tests run much, if any, faster). While doing this, I also noticed some tests that felt redundant or unnecessary, and I have removed those.
To facilitate these changes, I have introduced new "S" PE layouts for the f09, f19 and 0.125nldas2 resolutions. These layouts use 4 nodes (but with DATM on just 1 node, because it is slightly faster to run it on one node than four nodes). (I considered running DATM on its own node like we do for production PE layouts, but that turns out to only improve runtime slightly while increasing cost significantly, so I am having DATM share one of the nodes used for LND.)
Here is a detailed list of the changes:
Changes to aux_clm test list
I changed the following:
In addition, I moved the following tests to ctsm_sci:
and I completely removed the following tests:
In general, these removed tests felt redundant with existing tests (including some low-resolution tests that I happened to notice as feeling redundant as I was reworking the testing). However, I am not totally confident about my changes and removal of some drydepnomegan tests: I removed two tests of drydepnomegan and changed some others to lower resolution; is this okay, or is there particular value in having high-resolution tests of drydepnomegan?
Changes to ctsm_sci test list
See notes above for tests moved from the aux_clm test list to the ctsm_sci test list; other than that, I did not make any changes to the ctsm_sci test list.
Changes to prealpha test list
I made the following changes to the prealpha test list to keep it in line with the aux_clm test list:
Changes to prebeta test list
I made the following changes to the prealpha test list to keep it in line with the aux_clm test list:
In addition, I removed the following from the prebeta test list:
ERP_Ld5.f19_g17.I2000Clm50SpRtmFl.cheyenne_gnu.clm-default SMS_Ld1.f19_g17.I2000Clm50BgcCru.cheyenne_intel.clm-default SMS_Lm1.f19_g17.I1850Clm50BgcCropCmip6waccm.cheyenne_intel.clm-basic # The main point of this test is simply to make sure that the CMIP6WACCMDECK modifier works for 2-degree since that resolution turns off Carbon isotopes
For the removed tests, we already have similar tests to these removed tests in the prealpha and/or prebeta test suites. (It is unnecessary to have the same tests in both prealpha and prebeta test lists, because the prebeta test list is generally just run on tags that have already been through prealpha testing.)
Other test categories
I did not make any changes to other test categories: fates, aux_cime_baselines, clm_pymods or clm_short. The fates test category has some high resolution tests that should probably be changed similarly to what I did here, but I'd like to leave that up to fates developers.
Some notes on the changed tests
All of the new (changed) tests pass. Nearly all of them run in less than 10 min (and some of them actually run faster than before); the one exception is
SMS_Lm13_PS.f19_g17.I2000Clm51BgcCrop.cheyenne_intel.clm-cropMonthOutput
, which has a 31 min run time; that is as expected. I feel like that's acceptable: it's in the neighborhood of some of our other long tests, and I feel like it's worth having one more long-running test in order to avoid having anything that might sit in the queue too long. We could consider putting this test on something like 8 nodes instead of 4 if we find that it is consistently the last test to finish, but for now I wanted to try for a goal of having all tests on 4 nodes or fewer.I compared the detailed timing files from the new
PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel
test with the originalPFS_Ld20.f09_g17.I2000Clm50BgcCrop.cheyenne_intel
test. For the most part, the relative timings of different pieces are about the same in both – at least for the major culprits, which is what I compared. So this is good: I feel like we can use the new test to still give indications of when the timing of a particular section changes significantly, and be reasonably confident that the same result would hold for an f09 test with higher processor count.Specific notes
Contributors other than yourself, if any:
CTSM Issues Fixed (include github issue #):
Are answers expected to change (and if so in what way)? NO (but there will be baseline failures)
Any User Interface Changes (namelist or namelist defaults changes)? NO
Testing performed, if any:
Ran all new/changed tests: