Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467

Closed
golaz opened this issue Apr 27, 2017 · 144 comments · Fixed by #1487
Closed

Current master (c9903bde) not BFB on Edison: 143 vs 265 nodes #1467

golaz opened this issue Apr 27, 2017 · 144 comments · Fixed by #1487
Assignees

Comments

@golaz
Copy link
Contributor

golaz commented Apr 27, 2017

I'm afraid I have some bad news to report (or maybe I did something wrong; always a possibility).

I ran some tests with a recent version of master (c9903bd) using compset A_WCYCL1850S and resolution ne30_oECv3_ICG on Edison. I tried the new PE layouts provided by @jonbob for 143 and 265 nodes. Both 5 day tests ran successfully, unfortunately the results diverge between the two simulations after a few time steps (based on atm.log files). I verified that BFBFLAG is set to true, so my understanding is that I should get the same results.

Here is the run_acme script:
run_acme.20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison

Output on Edison is under:
/global/cscratch1/sd/golaz/ACME_simulations/20170426.beta1_05.A_WCYCL1850S.ne30_oECv3_ICG.edison/test???

@rljacob
Copy link
Member

rljacob commented Apr 27, 2017

@golaz our nightly testing has revealed a similar problem. We think it was introduced this week and are currently tracking it down (on Slack).

@golaz
Copy link
Contributor Author

golaz commented Apr 27, 2017

@rljacob - good to hear that this was caught and is being tracked down.

@rljacob
Copy link
Member

rljacob commented Apr 27, 2017

We found the source of our testing problem. (PR #1272). It has been removed from master which itself will change answers (that was a non-BFB PR). Hopefully that also solves your problem but we're not sure so please try again with the latest version of master.

@golaz
Copy link
Contributor Author

golaz commented Apr 27, 2017

Thanks, @rljacob and @bishtgautam. Trying now the latest version.

@golaz
Copy link
Contributor Author

golaz commented Apr 28, 2017

@rljacob : I tried again with (beea721), and unfortunately the results are still not BFB when comparing 133, 143 and 265 nodes.

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

Ok. What was the last version of master where this worked for you?

@ndkeen
Copy link
Contributor

ndkeen commented Apr 28, 2017

Would it be appropriate to try this with DEBUG?

@golaz
Copy link
Contributor Author

golaz commented Apr 28, 2017

According to my notes, the last time I checked was with b1c676f and it worked. But I don't routinely check, as I was assuming this was part of the standard ACME testing procedure.

@mt5555
Copy link
Contributor

mt5555 commented Apr 28, 2017

based on redsky testing of master, started 2017-04-28 03:52:49
ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel
passed.

So there must be something subtle that A_WCYCL1850S and resolution ne30_oECv3_ICG would not be reproducible, while A_WCYCL2000 ne30_oEC is reproducible.

@golaz
Copy link
Contributor Author

golaz commented Apr 28, 2017

This is helpful. It's unlikely that this is because of 2000 vs 1850 forcing, so most likely it is due to the use of spun-up ocean and sea-ice initial conditions.

@worleyph
Copy link
Contributor

@golaz, I also see this in my PE layout experiments (now that I look). In particular, only changing the number of OCN processes was sufficient.

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

That would suggest the code that does the parallel read/init of the spun-up IC's for the ocean may be an issue.

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

Tagging @mark-petersen

@jonbob
Copy link
Contributor

jonbob commented Apr 28, 2017

@worleyph - can you check and see if the test fails for A_WCYCL1850 and ne30_oEC60to30v3? That would definitely point to something in reading the initial condition files...

@worleyph
Copy link
Contributor

On titan, with current master, with Intel compiler,

 -compset A_WCYCL2000 -res ne30_oEC

and MPI-only PE layouts that differ only in the number of MPI processes in the OCN (512 -> 256), output in atm.log differs by 'nstep, te 5' .

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

Thats a compset/res we test all the time.

@ndkeen
Copy link
Contributor

ndkeen commented Apr 28, 2017

I'm trying

./create_test PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S
./create_test ERP_Ld3.ne30_oEC.A_WCYCL2000

@worleyph
Copy link
Contributor

My version of master had a "bug fix" in ocn_comp_mct.F and ice_comp_mct.F:

 300c300
 <     call MPAS_io_set_iotype(domain % iocontext, pio_iotype)
 ---
 > !pw    call MPAS_io_set_iotype(domain % iocontext, pio_iotype)

but I also found other recent A_WCYCL cases that show nonreproducible results when changing process count in OCN (and don't have this change).

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

ERP_Ld3.ne30_oEC.A_WCYCL2000.redsky_intel passed with hash c9903bd from master. ERP is supposed to change the mpi-tasks in all components in the middle of a restart and test BFB.

@worleyph
Copy link
Contributor

My tests all have OCN on its own nodes (as does @golaz 's experiments). I am building a job with components stacked, to see whether this makes a difference.

@ndkeen
Copy link
Contributor

ndkeen commented Apr 28, 2017

This issue isn't involved right? ESMCI/cime#1433

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

Doubtful.

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

Pat, try your test with c9903bd. That version had a passing ERP test on redsky but it failed for Chris.

@rljacob
Copy link
Member

rljacob commented Apr 28, 2017

On Redsky, the ERP_Ld3.ne30_oEC.A_WCYCL2000. test that passed starts with everything stacked on 512 tasks. It then halves them to 256. That should find this bug if it was present in that compset/resolution.

@worleyph
Copy link
Contributor

worleyph commented May 3, 2017

My recent experiments have all been very small node count layouts, on Titan (32 nodes and 16 nodes). I did see the problem on Edison (when I went back to look) for both 173 node and 133 node PE layouts (my attempt to improve on @jonbob 's work).

@rljacob
Copy link
Member

rljacob commented May 3, 2017

If it starts in the ocean and shows up in the atmosphere, it must be in the coupler.

For the ERP_Ld3.ne30_oEC.A_WCYCL2000 test, skybridge tests 128 and 64 mpi tasks counts. Redsky tests 512 and 256. Blues tests 1024 and 512. Do those 3 pairs of task counts and their associated decompositions not provoke this bug?

@jonbob
Copy link
Contributor

jonbob commented May 3, 2017

@rljacob - they should. Can you compare cpl restart files instead of history files and see if that catches it? Though I agree, if it shows up in the atm it should be in cpl history as well. Any chance the test is not working as it should?

mark-petersen added a commit that referenced this issue May 3, 2017
This in the fix to issue #1467, where different block partitions of
the ocean did not match bit-for-bit between runs.  This PR adds a halo
update within the barotropic subcycling of the ocean split explicit
timestep.  The actual cause of the problem is most likely a few
locations with a two-wide halo, where boundary noise is getting to the
domain. There may be a more elegant fix to halo creation later on, but
this PR provides a fix with the correct result.  The change to the
elegant halo fix would be bit-for-bit with this PR.

This closes #1467
@mark-petersen
Copy link
Contributor

See #1487. @worleyph and @jonbob please do any further testing on that PR, which should be the same as what you have been testing today.

@worleyph
Copy link
Contributor

worleyph commented May 3, 2017

@mark-petersen , I've run out of time to do any more testing at the moment. Hopefully @jonbob and others can take a look at this. Thanks for figuring this out.

@rljacob
Copy link
Member

rljacob commented May 3, 2017

cpl history actually has more fields then restart. Pat said 32 and 64 will fail so I'll try an ERP test with that and check the nstep values too.

@rljacob
Copy link
Member

rljacob commented May 3, 2017

The pairs of testing configs I mentioned above where mpi tasks counts. Pat, did you mean 32 and 64 mpi-tasks show difference?

According to this: #1387 (comment) edison had 960 vs. 720 mpi-tasks for the ocean showing a difference. I'd like something smaller that definitely fails.

@worleyph
Copy link
Contributor

worleyph commented May 3, 2017

No - I was using 256 and 512. Doesn't mean that smaller won't show it, but this was a balance to get something that would run under 30 minutes for 1 day and getting through the debug queue quickly. I didn't spend any time looking for anything smaller.

@mark-petersen
Copy link
Contributor

mark-petersen commented May 3, 2017

In MPAS stand-alone, the EC60to30 failed to match between 8 and 16 partitions after one step. It should fail to match in ACME as well, if you want something that small. The graph files are in the repo:

inputdata/ocn/mpas-o/oEC60to30v3/mpas-o.graph.info.161222.part.8
inputdata/ocn/mpas-o/oEC60to30v3/mpas-o.graph.info.161222.part.16

That is, they failed before this PR, and match after this PR.

jonbob added a commit that referenced this issue May 3, 2017
…1487)

This in the fix to issue #1467, where different block partitions of
the ocean did not match bit-for-bit between runs. This PR adds a halo
update within the barotropic subcycling of the ocean split explicit
timestep. The actual cause of the problem is most likely a few
locations with a two-wide halo, where boundary noise is getting to the
domain. There may be a more elegant fix to halo creation later on, but
this PR provides a fix with the correct result. The change to the
elegant halo fix would be bit-for-bit with this PR.

Tested with anvil GMPAS-IAF T62_oEC60to30v3 runs using 640 and 1120
ocean pes

Fixes #1467

[non-BFB]
@rljacob
Copy link
Member

rljacob commented May 3, 2017

Thanks, Mark. I assume you are doing 2 initial runs. If the 8 pe run started with a restart from the 16 pe run, would the results be different (without the bug fix)?

@mark-petersen
Copy link
Contributor

Yes, my test was to run two simulations, an 8 pe and 16 pe, from the same initial condition, both for a single time step, and compare the output.

In your case, your run: 16 pe -> restart with 8pe. What are you comparing to? If it is a full duration, either 8 or 16 pe, they should indeed not match.

@rljacob
Copy link
Member

rljacob commented May 3, 2017

Suppose you run 10 steps with 16 pe and write a restart at step 5. Pick up that restart with 8pe and run 5 steps. The values at the end for each run should match. But with the bug, they won't match?

@mark-petersen
Copy link
Contributor

Correct. They should: mismatch before this PR, match after.

@rljacob
Copy link
Member

rljacob commented May 4, 2017

One more clarification: will any of the oEC60to30 grids show this or only oEC60to30v3?

@mark-petersen
Copy link
Contributor

Any. @worleyph tried a previous one, and it had a bfb mismatch between different partitions. The versions only differ in how they treat single-cell wide channels.

@ndkeen
Copy link
Contributor

ndkeen commented May 4, 2017

fyi, I added this change and
PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S now passes on cori-knl (was failing compare before)

However, PEM_Ln9.ne4_oQU240.A_WCYCL2000 still fails compare.

@mark-petersen
Copy link
Contributor

Has PEM_Ln9.ne4_oQU240.A_WCYCL2000 ever passed compare? I suspect it is a different problem than the one we just fixed here.

BTW, the documentation has the wrong comparison case:

vi cime/scripts/lib/CIME/SystemTests/README
PEM    modified pe counts mpi bfb test (seq tests)
       do an initial run with default pe layout                               (suffix: base)
       do another initial run with modified pes (NTASKS_XXX => NTASKS_XXX/2)  (suffix: modpes)
       compare base and single_thread

last word should be modpes. Same with
cime/config/config_tests.xml line 87
but it is correct in the actual test setup at vi cime/scripts/lib/CIME/SystemTests/pem.py.

@rljacob
Copy link
Member

rljacob commented May 4, 2017

Yes that would be different problem. What machine, compiler and hash did that test fail on?

@ndkeen
Copy link
Contributor

ndkeen commented May 4, 2017

This failure is on cori-knl hash=beea721
The only changes I've made are the ones above.

PEM_D_Ld3.T62_oQU120.CMPASO-NYF passes (as it always has)

And now I'm seeing that this test is now passing where it was failing before the change.
PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S

Actually, I did make one other change. I added an entry for cori-knl to have a more reasonable pe layout for the ne30 case (otherwise default was like 4 nodes). I used one similar to edison and updated slightly to account for 64 cores per node.

+  <grid name="a%ne30np4">
+    <mach name="cori-knl">
+      <pes compset="CAM5.+CLM45.+MPASCICE.+MPASO.+MOSART.+SGLC.+SWAV" pesize="any">
+        <comment>"first try"</comment>
+        <ntasks>
+          <ntasks_atm>5400</ntasks_atm>
+          <ntasks_lnd>640</ntasks_lnd>
+          <ntasks_rof>640</ntasks_rof>
+          <ntasks_ice>4800</ntasks_ice>
+          <ntasks_ocn>3600</ntasks_ocn>
+          <ntasks_glc>640</ntasks_glc>
+          <ntasks_wav>4800</ntasks_wav>
+          <ntasks_cpl>4800</ntasks_cpl>
+        </ntasks>
+        <nthrds>
+          <nthrds_atm>1</nthrds_atm>
+          <nthrds_lnd>1</nthrds_lnd>
+          <nthrds_rof>1</nthrds_rof>
+          <nthrds_ice>1</nthrds_ice>
+          <nthrds_ocn>1</nthrds_ocn>
+          <nthrds_glc>1</nthrds_glc>
+          <nthrds_wav>1</nthrds_wav>
+          <nthrds_cpl>1</nthrds_cpl>
+        </nthrds>
+        <rootpe>
+          <rootpe_atm>0</rootpe_atm>
+          <rootpe_lnd>4800</rootpe_lnd>
+          <rootpe_rof>4800</rootpe_rof>
+          <rootpe_ice>0</rootpe_ice>
+          <rootpe_ocn>5440</rootpe_ocn>
+          <rootpe_glc>4800</rootpe_glc>
+          <rootpe_wav>0</rootpe_wav>
+          <rootpe_cpl>0</rootpe_cpl>
+        </rootpe>
+      </pes>
+    </mach>
+  </grid>

@rljacob
Copy link
Member

rljacob commented May 4, 2017

@ndkeen ok. Thanks for finding PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S! I was looking for a test that could test this bug.

@ndkeen
Copy link
Contributor

ndkeen commented May 4, 2017

Note a run just finished on edison (4 days in Q) and I updated an old comment where I was trying to organize this -- the GNU build of PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S is failing compare. This would be before this change. Should I try it again with the change?

@rljacob
Copy link
Member

rljacob commented May 4, 2017

Yes please.

jonbob added a commit that referenced this issue May 4, 2017
This in the fix to issue #1467, where different block partitions of
the ocean did not match bit-for-bit between runs. This PR adds a halo
update within the barotropic subcycling of the ocean split explicit
timestep. The actual cause of the problem is most likely a few
locations with a two-wide halo, where boundary noise is getting to the
domain. There may be a more elegant fix to halo creation later on, but
this PR provides a fix with the correct result. The change to the
elegant halo fix would be bit-for-bit with this PR.

Tested with anvil GMPAS-IAF T62_oEC60to30v3 runs using 640 and 1120
ocean pes

Fixes #1467

[non-BFB]
@ndkeen
Copy link
Contributor

ndkeen commented May 8, 2017

OK, verified that this fix also allows PEM_Ld3.ne30_oECv3_ICG.A_WCYCL1850S to pass on edison with GNU

jgfouca added a commit that referenced this issue Jun 2, 2017
env_mach_specific.xsd schema was incomplete

The env_mach_specific.xsd file introduced yesterday was incomplete and caused errors on some platforms
Test suite: scripts_regression_tests.py (on cheyenne)
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes

User interface changes?:

Code review:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants