Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467

golaz · 2017-04-27T21:07:43Z

I'm afraid I have some bad news to report (or maybe I did something wrong; always a possibility).

I ran some tests with a recent version of master (c9903bd) using compset A_WCYCL1850S and resolution ne30_oECv3_ICG on Edison. I tried the new PE layouts provided by @jonbob for 143 and 265 nodes. Both 5 day tests ran successfully, unfortunately the results diverge between the two simulations after a few time steps (based on atm.log files). I verified that BFBFLAG is set to true, so my understanding is that I should get the same results.

Here is the run_acme script:
run_acme.20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison

Output on Edison is under:
/global/cscratch1/sd/golaz/ACME_simulations/20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison/test???

The text was updated successfully, but these errors were encountered:

rljacob · 2017-04-27T21:10:07Z

@golaz our nightly testing has revealed a similar problem. We think it was introduced this week and are currently tracking it down (on Slack).

golaz · 2017-04-27T21:13:32Z

@rljacob - good to hear that this was caught and is being tracked down.

rljacob · 2017-04-27T22:56:13Z

We found the source of our testing problem. (PR #1272). It has been removed from master which itself will change answers (that was a non-BFB PR). Hopefully that also solves your problem but we're not sure so please try again with the latest version of master.

golaz · 2017-04-27T23:43:40Z

Thanks, @rljacob and @bishtgautam. Trying now the latest version.

golaz · 2017-04-28T03:36:42Z

@rljacob : I tried again with (beea721), and unfortunately the results are still not BFB when comparing 133, 143 and 265 nodes.

rljacob · 2017-04-28T13:55:36Z

Ok. What was the last version of master where this worked for you?

ndkeen · 2017-04-28T15:33:41Z

Would it be appropriate to try this with DEBUG?

golaz · 2017-04-28T15:40:03Z

According to my notes, the last time I checked was with b1c676f and it worked. But I don't routinely check, as I was assuming this was part of the standard ACME testing procedure.

mt5555 · 2017-04-28T15:43:28Z

based on redsky testing of master, started 2017-04-28 03:52:49
ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel
passed.

So there must be something subtle that A_WCYCL1850S and resolution ne30_oECv3_ICG would not be reproducible, while A_WCYCL2000 ne30_oEC is reproducible.

golaz · 2017-04-28T15:53:28Z

This is helpful. It's unlikely that this is because of 2000 vs 1850 forcing, so most likely it is due to the use of spun-up ocean and sea-ice initial conditions.

worleyph · 2017-04-28T15:56:28Z

@golaz, I also see this in my PE layout experiments (now that I look). In particular, only changing the number of OCN processes was sufficient.

rljacob · 2017-04-28T16:09:40Z

That would suggest the code that does the parallel read/init of the spun-up IC's for the ocean may be an issue.

rljacob · 2017-04-28T16:24:34Z

Tagging @mark-petersen

jonbob · 2017-04-28T16:34:05Z

@worleyph - can you check and see if the test fails for A_WCYCL1850 and ne30_oEC60to30v3? That would definitely point to something in reading the initial condition files...

worleyph · 2017-04-28T17:26:08Z

On titan, with current master, with Intel compiler,

 -compset A_WCYCL2000 -res ne30_oEC

and MPI-only PE layouts that differ only in the number of MPI processes in the OCN (512 -> 256), output in atm.log differs by 'nstep, te 5' .

rljacob · 2017-04-28T17:35:34Z

Thats a compset/res we test all the time.

ndkeen · 2017-04-28T17:44:24Z

I'm trying

./create_test PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S
./create_test ERP_Ld3.ne30_oEC.A_WCYCL2000

worleyph · 2017-04-28T17:50:31Z

My version of master had a "bug fix" in ocn_comp_mct.F and ice_comp_mct.F:

 300c300
 <     call MPAS_io_set_iotype(domain % iocontext, pio_iotype)
 ---
 > !pw    call MPAS_io_set_iotype(domain % iocontext, pio_iotype)

but I also found other recent A_WCYCL cases that show nonreproducible results when changing process count in OCN (and don't have this change).

rljacob · 2017-04-28T17:52:17Z

ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed with hash c9903bd from master. ERP is supposed to change the mpi-tasks in all components in the middle of a restart and test BFB.

worleyph · 2017-04-28T17:56:50Z

My tests all have OCN on its own nodes (as does @golaz 's experiments). I am building a job with components stacked, to see whether this makes a difference.

ndkeen · 2017-04-28T18:24:53Z

This issue isn't involved right? ESMCI/cime#1433

rljacob · 2017-04-28T18:31:10Z

Doubtful.

rljacob · 2017-04-28T18:32:05Z

Pat, try your test with c9903bd. That version had a passing ERP test on redsky but it failed for Chris.

rljacob · 2017-04-28T18:36:03Z

On Redsky, the ERP_Ld3.ne30_oEC.A_WCYCL2000. test that passed starts with everything stacked on 512 tasks. It then halves them to 256. That should find this bug if it was present in that compset/resolution.

worleyph · 2017-05-03T20:13:06Z

My recent experiments have all been very small node count layouts, on Titan (32 nodes and 16 nodes). I did see the problem on Edison (when I went back to look) for both 173 node and 133 node PE layouts (my attempt to improve on @jonbob 's work).

rljacob · 2017-05-03T20:22:57Z

If it starts in the ocean and shows up in the atmosphere, it must be in the coupler.

For the ERP_Ld3.ne30_oEC.A_WCYCL2000 test, skybridge tests 128 and 64 mpi tasks counts. Redsky tests 512 and 256. Blues tests 1024 and 512. Do those 3 pairs of task counts and their associated decompositions not provoke this bug?

jonbob · 2017-05-03T20:25:42Z

@rljacob - they should. Can you compare cpl restart files instead of history files and see if that catches it? Though I agree, if it shows up in the atm it should be in cpl history as well. Any chance the test is not working as it should?

This in the fix to issue #1467, where different block partitions of the ocean did not match bit-for-bit between runs. This PR adds a halo update within the barotropic subcycling of the ocean split explicit timestep. The actual cause of the problem is most likely a few locations with a two-wide halo, where boundary noise is getting to the domain. There may be a more elegant fix to halo creation later on, but this PR provides a fix with the correct result. The change to the elegant halo fix would be bit-for-bit with this PR. This closes #1467

mark-petersen · 2017-05-03T20:44:14Z

See #1487. @worleyph and @jonbob please do any further testing on that PR, which should be the same as what you have been testing today.

worleyph · 2017-05-03T20:55:57Z

@mark-petersen , I've run out of time to do any more testing at the moment. Hopefully @jonbob and others can take a look at this. Thanks for figuring this out.

rljacob · 2017-05-03T21:36:50Z

cpl history actually has more fields then restart. Pat said 32 and 64 will fail so I'll try an ERP test with that and check the nstep values too.

rljacob · 2017-05-03T21:44:34Z

The pairs of testing configs I mentioned above where mpi tasks counts. Pat, did you mean 32 and 64 mpi-tasks show difference?

According to this: #1387 (comment) edison had 960 vs. 720 mpi-tasks for the ocean showing a difference. I'd like something smaller that definitely fails.

worleyph · 2017-05-03T21:46:19Z

No - I was using 256 and 512. Doesn't mean that smaller won't show it, but this was a balance to get something that would run under 30 minutes for 1 day and getting through the debug queue quickly. I didn't spend any time looking for anything smaller.

mark-petersen · 2017-05-03T21:50:15Z

In MPAS stand-alone, the EC60to30 failed to match between 8 and 16 partitions after one step. It should fail to match in ACME as well, if you want something that small. The graph files are in the repo:

inputdata/ocn/mpas-o/oEC60to30v3/mpas-o.graph.info.161222.part.8
inputdata/ocn/mpas-o/oEC60to30v3/mpas-o.graph.info.161222.part.16

That is, they failed before this PR, and match after this PR.

…1487) This in the fix to issue #1467, where different block partitions of the ocean did not match bit-for-bit between runs. This PR adds a halo update within the barotropic subcycling of the ocean split explicit timestep. The actual cause of the problem is most likely a few locations with a two-wide halo, where boundary noise is getting to the domain. There may be a more elegant fix to halo creation later on, but this PR provides a fix with the correct result. The change to the elegant halo fix would be bit-for-bit with this PR. Tested with anvil GMPAS-IAF T62_oEC60to30v3 runs using 640 and 1120 ocean pes Fixes #1467 [non-BFB]

rljacob · 2017-05-03T22:23:54Z

Thanks, Mark. I assume you are doing 2 initial runs. If the 8 pe run started with a restart from the 16 pe run, would the results be different (without the bug fix)?

mark-petersen · 2017-05-03T22:29:57Z

Yes, my test was to run two simulations, an 8 pe and 16 pe, from the same initial condition, both for a single time step, and compare the output.

In your case, your run: 16 pe -> restart with 8pe. What are you comparing to? If it is a full duration, either 8 or 16 pe, they should indeed not match.

rljacob · 2017-05-03T22:36:12Z

Suppose you run 10 steps with 16 pe and write a restart at step 5. Pick up that restart with 8pe and run 5 steps. The values at the end for each run should match. But with the bug, they won't match?

mark-petersen · 2017-05-03T22:38:36Z

Correct. They should: mismatch before this PR, match after.

rljacob · 2017-05-04T03:22:20Z

One more clarification: will any of the oEC60to30 grids show this or only oEC60to30v3?

mark-petersen · 2017-05-04T04:33:50Z

Any. @worleyph tried a previous one, and it had a bfb mismatch between different partitions. The versions only differ in how they treat single-cell wide channels.

ndkeen · 2017-05-04T05:31:04Z

fyi, I added this change and
PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S now passes on cori-knl (was failing compare before)

However, PEM_Ln9.ne4_oQU240.A_WCYCL2000 still fails compare.

mark-petersen · 2017-05-04T12:38:18Z

Has PEM_Ln9.ne4_oQU240.A_WCYCL2000 ever passed compare? I suspect it is a different problem than the one we just fixed here.

BTW, the documentation has the wrong comparison case:

vi cime/scripts/lib/CIME/SystemTests/README
PEM    modified pe counts mpi bfb test (seq tests)
       do an initial run with default pe layout                               (suffix: base)
       do another initial run with modified pes (NTASKS_XXX => NTASKS_XXX/2)  (suffix: modpes)
       compare base and single_thread

last word should be modpes. Same with
cime/config/config_tests.xml line 87
but it is correct in the actual test setup at vi cime/scripts/lib/CIME/SystemTests/pem.py.

rljacob · 2017-05-04T16:00:49Z

Yes that would be different problem. What machine, compiler and hash did that test fail on?

ndkeen · 2017-05-04T17:21:24Z

This failure is on cori-knl hash=beea721
The only changes I've made are the ones above.

PEM_D_Ld3.T62_oQU120.CMPASO-NYF passes (as it always has)

And now I'm seeing that this test is now passing where it was failing before the change.
PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S

Actually, I did make one other change. I added an entry for cori-knl to have a more reasonable pe layout for the ne30 case (otherwise default was like 4 nodes). I used one similar to edison and updated slightly to account for 64 cores per node.

+  <grid name="a%ne30np4">
+    <mach name="cori-knl">
+      <pes compset="CAM5.+CLM45.+MPASCICE.+MPASO.+MOSART.+SGLC.+SWAV" pesize="any">
+        <comment>"first try"</comment>
+        <ntasks>
+          <ntasks_atm>5400</ntasks_atm>
+          <ntasks_lnd>640</ntasks_lnd>
+          <ntasks_rof>640</ntasks_rof>
+          <ntasks_ice>4800</ntasks_ice>
+          <ntasks_ocn>3600</ntasks_ocn>
+          <ntasks_glc>640</ntasks_glc>
+          <ntasks_wav>4800</ntasks_wav>
+          <ntasks_cpl>4800</ntasks_cpl>
+        </ntasks>
+        <nthrds>
+          <nthrds_atm>1</nthrds_atm>
+          <nthrds_lnd>1</nthrds_lnd>
+          <nthrds_rof>1</nthrds_rof>
+          <nthrds_ice>1</nthrds_ice>
+          <nthrds_ocn>1</nthrds_ocn>
+          <nthrds_glc>1</nthrds_glc>
+          <nthrds_wav>1</nthrds_wav>
+          <nthrds_cpl>1</nthrds_cpl>
+        </nthrds>
+        <rootpe>
+          <rootpe_atm>0</rootpe_atm>
+          <rootpe_lnd>4800</rootpe_lnd>
+          <rootpe_rof>4800</rootpe_rof>
+          <rootpe_ice>0</rootpe_ice>
+          <rootpe_ocn>5440</rootpe_ocn>
+          <rootpe_glc>4800</rootpe_glc>
+          <rootpe_wav>0</rootpe_wav>
+          <rootpe_cpl>0</rootpe_cpl>
+        </rootpe>
+      </pes>
+    </mach>
+  </grid>

rljacob · 2017-05-04T17:25:39Z

@ndkeen ok. Thanks for finding PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S! I was looking for a test that could test this bug.

ndkeen · 2017-05-04T17:30:46Z

Note a run just finished on edison (4 days in Q) and I updated an old comment where I was trying to organize this -- the GNU build of PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S is failing compare. This would be before this change. Should I try it again with the change?

rljacob · 2017-05-04T17:40:02Z

Yes please.

This in the fix to issue #1467, where different block partitions of the ocean did not match bit-for-bit between runs. This PR adds a halo update within the barotropic subcycling of the ocean split explicit timestep. The actual cause of the problem is most likely a few locations with a two-wide halo, where boundary noise is getting to the domain. There may be a more elegant fix to halo creation later on, but this PR provides a fix with the correct result. The change to the elegant halo fix would be bit-for-bit with this PR. Tested with anvil GMPAS-IAF T62_oEC60to30v3 runs using 640 and 1120 ocean pes Fixes #1467 [non-BFB]

ndkeen · 2017-05-08T21:46:16Z

OK, verified that this fix also allows PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S to pass on edison with GNU

env_mach_specific.xsd schema was incomplete The env_mach_specific.xsd file introduced yesterday was incomplete and caused errors on some platforms Test suite: scripts_regression_tests.py (on cheyenne) Test baseline: Test namelist changes: Test status: bit for bit Fixes User interface changes?: Code review:

rljacob closed this as completed Apr 27, 2017

rljacob reopened this Apr 27, 2017

bishtgautam mentioned this issue Apr 27, 2017

Fix negative leaf intracellular CO2 computed in PhotosynthesisMod #1272

Merged

rljacob assigned ndkeen Apr 28, 2017

rljacob added bug Critical labels Apr 28, 2017

mark-petersen mentioned this issue May 3, 2017

Update MPAS-O submodule to fix block decomposition error #1487

Merged

jonbob closed this as completed in #1487 May 4, 2017

rljacob mentioned this issue May 4, 2017

PEM_Ln9.ne4_oQU240.A_WCYCL2000 failing on cori-knl #1496

Closed

Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467

Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467

Comments

golaz commented Apr 27, 2017

rljacob commented Apr 27, 2017

golaz commented Apr 27, 2017

rljacob commented Apr 27, 2017

golaz commented Apr 27, 2017

golaz commented Apr 28, 2017

rljacob commented Apr 28, 2017

ndkeen commented Apr 28, 2017

golaz commented Apr 28, 2017

mt5555 commented Apr 28, 2017

golaz commented Apr 28, 2017

worleyph commented Apr 28, 2017

rljacob commented Apr 28, 2017

rljacob commented Apr 28, 2017

jonbob commented Apr 28, 2017

worleyph commented Apr 28, 2017

rljacob commented Apr 28, 2017

ndkeen commented Apr 28, 2017 • edited Loading

worleyph commented Apr 28, 2017

rljacob commented Apr 28, 2017

worleyph commented Apr 28, 2017

ndkeen commented Apr 28, 2017

rljacob commented Apr 28, 2017

rljacob commented Apr 28, 2017

rljacob commented Apr 28, 2017

worleyph commented May 3, 2017

rljacob commented May 3, 2017 • edited Loading

jonbob commented May 3, 2017

mark-petersen commented May 3, 2017

worleyph commented May 3, 2017

rljacob commented May 3, 2017

rljacob commented May 3, 2017

worleyph commented May 3, 2017

mark-petersen commented May 3, 2017 • edited Loading

rljacob commented May 3, 2017

mark-petersen commented May 3, 2017

rljacob commented May 3, 2017 • edited Loading

mark-petersen commented May 3, 2017

rljacob commented May 4, 2017

mark-petersen commented May 4, 2017

ndkeen commented May 4, 2017

mark-petersen commented May 4, 2017

rljacob commented May 4, 2017

ndkeen commented May 4, 2017 • edited Loading

rljacob commented May 4, 2017

ndkeen commented May 4, 2017

rljacob commented May 4, 2017

ndkeen commented May 8, 2017

ndkeen commented Apr 28, 2017 •

edited

Loading

rljacob commented May 3, 2017 •

edited

Loading

mark-petersen commented May 3, 2017 •

edited

Loading

rljacob commented May 3, 2017 •

edited

Loading

ndkeen commented May 4, 2017 •

edited

Loading