-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467
Comments
@golaz our nightly testing has revealed a similar problem. We think it was introduced this week and are currently tracking it down (on Slack). |
@rljacob - good to hear that this was caught and is being tracked down. |
We found the source of our testing problem. (PR #1272). It has been removed from master which itself will change answers (that was a non-BFB PR). Hopefully that also solves your problem but we're not sure so please try again with the latest version of master. |
Thanks, @rljacob and @bishtgautam. Trying now the latest version. |
Ok. What was the last version of master where this worked for you? |
Would it be appropriate to try this with DEBUG? |
According to my notes, the last time I checked was with b1c676f and it worked. But I don't routinely check, as I was assuming this was part of the standard ACME testing procedure. |
based on redsky testing of master, started 2017-04-28 03:52:49 So there must be something subtle that A_WCYCL1850S and resolution ne30_oECv3_ICG would not be reproducible, while A_WCYCL2000 ne30_oEC is reproducible. |
This is helpful. It's unlikely that this is because of 2000 vs 1850 forcing, so most likely it is due to the use of spun-up ocean and sea-ice initial conditions. |
@golaz, I also see this in my PE layout experiments (now that I look). In particular, only changing the number of OCN processes was sufficient. |
That would suggest the code that does the parallel read/init of the spun-up IC's for the ocean may be an issue. |
Tagging @mark-petersen |
@worleyph - can you check and see if the test fails for A_WCYCL1850 and ne30_oEC60to30v3? That would definitely point to something in reading the initial condition files... |
On titan, with current master, with Intel compiler,
and MPI-only PE layouts that differ only in the number of MPI processes in the OCN (512 -> 256), output in atm.log differs by 'nstep, te 5' . |
Thats a compset/res we test all the time. |
I'm trying
|
My version of master had a "bug fix" in ocn_comp_mct.F and ice_comp_mct.F:
but I also found other recent A_WCYCL cases that show nonreproducible results when changing process count in OCN (and don't have this change). |
ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed with hash c9903bd from master. ERP is supposed to change the mpi-tasks in all components in the middle of a restart and test BFB. |
My tests all have OCN on its own nodes (as does @golaz 's experiments). I am building a job with components stacked, to see whether this makes a difference. |
This issue isn't involved right? ESMCI/cime#1433 |
Doubtful. |
Pat, try your test with c9903bd. That version had a passing ERP test on redsky but it failed for Chris. |
On Redsky, the ERP_Ld3.ne30_oEC.A_WCYCL2000. test that passed starts with everything stacked on 512 tasks. It then halves them to 256. That should find this bug if it was present in that compset/resolution. |
My recent experiments have all been very small node count layouts, on Titan (32 nodes and 16 nodes). I did see the problem on Edison (when I went back to look) for both 173 node and 133 node PE layouts (my attempt to improve on @jonbob 's work). |
If it starts in the ocean and shows up in the atmosphere, it must be in the coupler. For the ERP_Ld3.ne30_oEC.A_WCYCL2000 test, skybridge tests 128 and 64 mpi tasks counts. Redsky tests 512 and 256. Blues tests 1024 and 512. Do those 3 pairs of task counts and their associated decompositions not provoke this bug? |
@rljacob - they should. Can you compare cpl restart files instead of history files and see if that catches it? Though I agree, if it shows up in the atm it should be in cpl history as well. Any chance the test is not working as it should? |
This in the fix to issue #1467, where different block partitions of the ocean did not match bit-for-bit between runs. This PR adds a halo update within the barotropic subcycling of the ocean split explicit timestep. The actual cause of the problem is most likely a few locations with a two-wide halo, where boundary noise is getting to the domain. There may be a more elegant fix to halo creation later on, but this PR provides a fix with the correct result. The change to the elegant halo fix would be bit-for-bit with this PR. This closes #1467
@mark-petersen , I've run out of time to do any more testing at the moment. Hopefully @jonbob and others can take a look at this. Thanks for figuring this out. |
cpl history actually has more fields then restart. Pat said 32 and 64 will fail so I'll try an ERP test with that and check the nstep values too. |
The pairs of testing configs I mentioned above where mpi tasks counts. Pat, did you mean 32 and 64 mpi-tasks show difference? According to this: #1387 (comment) edison had 960 vs. 720 mpi-tasks for the ocean showing a difference. I'd like something smaller that definitely fails. |
No - I was using 256 and 512. Doesn't mean that smaller won't show it, but this was a balance to get something that would run under 30 minutes for 1 day and getting through the debug queue quickly. I didn't spend any time looking for anything smaller. |
In MPAS stand-alone, the EC60to30 failed to match between 8 and 16 partitions after one step. It should fail to match in ACME as well, if you want something that small. The graph files are in the repo:
That is, they failed before this PR, and match after this PR. |
…1487) This in the fix to issue #1467, where different block partitions of the ocean did not match bit-for-bit between runs. This PR adds a halo update within the barotropic subcycling of the ocean split explicit timestep. The actual cause of the problem is most likely a few locations with a two-wide halo, where boundary noise is getting to the domain. There may be a more elegant fix to halo creation later on, but this PR provides a fix with the correct result. The change to the elegant halo fix would be bit-for-bit with this PR. Tested with anvil GMPAS-IAF T62_oEC60to30v3 runs using 640 and 1120 ocean pes Fixes #1467 [non-BFB]
Thanks, Mark. I assume you are doing 2 initial runs. If the 8 pe run started with a restart from the 16 pe run, would the results be different (without the bug fix)? |
Yes, my test was to run two simulations, an 8 pe and 16 pe, from the same initial condition, both for a single time step, and compare the output. In your case, your run: 16 pe -> restart with 8pe. What are you comparing to? If it is a full duration, either 8 or 16 pe, they should indeed not match. |
Suppose you run 10 steps with 16 pe and write a restart at step 5. Pick up that restart with 8pe and run 5 steps. The values at the end for each run should match. But with the bug, they won't match? |
Correct. They should: mismatch before this PR, match after. |
One more clarification: will any of the oEC60to30 grids show this or only oEC60to30v3? |
Any. @worleyph tried a previous one, and it had a bfb mismatch between different partitions. The versions only differ in how they treat single-cell wide channels. |
fyi, I added this change and However, |
Has BTW, the documentation has the wrong comparison case:
last word should be modpes. Same with |
Yes that would be different problem. What machine, compiler and hash did that test fail on? |
This failure is on cori-knl hash=
And now I'm seeing that this test is now passing where it was failing before the change. Actually, I did make one other change. I added an entry for cori-knl to have a more reasonable pe layout for the ne30 case (otherwise default was like 4 nodes). I used one similar to edison and updated slightly to account for 64 cores per node.
|
@ndkeen ok. Thanks for finding PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S! I was looking for a test that could test this bug. |
Note a run just finished on edison (4 days in Q) and I updated an old comment where I was trying to organize this -- the GNU build of |
Yes please. |
This in the fix to issue #1467, where different block partitions of the ocean did not match bit-for-bit between runs. This PR adds a halo update within the barotropic subcycling of the ocean split explicit timestep. The actual cause of the problem is most likely a few locations with a two-wide halo, where boundary noise is getting to the domain. There may be a more elegant fix to halo creation later on, but this PR provides a fix with the correct result. The change to the elegant halo fix would be bit-for-bit with this PR. Tested with anvil GMPAS-IAF T62_oEC60to30v3 runs using 640 and 1120 ocean pes Fixes #1467 [non-BFB]
OK, verified that this fix also allows |
env_mach_specific.xsd schema was incomplete The env_mach_specific.xsd file introduced yesterday was incomplete and caused errors on some platforms Test suite: scripts_regression_tests.py (on cheyenne) Test baseline: Test namelist changes: Test status: bit for bit Fixes User interface changes?: Code review:
I'm afraid I have some bad news to report (or maybe I did something wrong; always a possibility).
I ran some tests with a recent version of master (c9903bd) using compset A_WCYCL1850S and resolution ne30_oECv3_ICG on Edison. I tried the new PE layouts provided by @jonbob for 143 and 265 nodes. Both 5 day tests ran successfully, unfortunately the results diverge between the two simulations after a few time steps (based on atm.log files). I verified that BFBFLAG is set to true, so my understanding is that I should get the same results.
Here is the run_acme script:
run_acme.20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison
Output on Edison is under:
/global/cscratch1/sd/golaz/ACME_simulations/20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison/test???
The text was updated successfully, but these errors were encountered: