Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart reproducibility broken for GSL suite - fix and add test to rt.conf #703

Closed
climbfuji opened this issue Jul 22, 2021 · 5 comments · Fixed by #937
Closed

Restart reproducibility broken for GSL suite - fix and add test to rt.conf #703

climbfuji opened this issue Jul 22, 2021 · 5 comments · Fixed by #937
Labels
bug Something isn't working

Comments

@climbfuji
Copy link
Collaborator

Description

Once again, the updates to the authoritative ufs-weather-model repository and its submodules broke the restart reproducibility for the GSL suite (FV3_GSD_v0). This happens every few months. We need to add restart tests (that already exist in rt_ccpp_dev.conf) to rt.conf to catch these issues at the time the "offending" code gets tested and merged.

We should consider updating this restart test, which currently runs 0-48h for the continuous run, 0-24h for coldstart and 24-48h for restart (called warmstart; name should be changed, too).

To Reproduce:

On Hera, with Intel or GNU:

./rt.sh -l rt_ccpp_dev.conf -c -e 2>&1 | tee rt_ccpp_dev_create.log
./rt.sh -l rt_ccpp_dev.conf -m -e 2>&1 | tee rt_ccpp_dev_verify.log
@climbfuji climbfuji added the bug Something isn't working label Jul 22, 2021
@MinsukJi-NOAA
Copy link
Contributor

Is it fv3_gsd_warmstart that is failing?

@climbfuji
Copy link
Collaborator Author

Is it fv3_gsd_warmstart that is failing?

Yes

@climbfuji
Copy link
Collaborator Author

It looks like this:

baseline dir = /scratch1/NCEPDEV/stmp4/Dom.Heinzeller/FV3_RT/REGRESSION_TEST_GSL_DEVELOP_GNU/fv3_gsd_repro
working dir  = /scratch1/NCEPDEV/stmp2/Dom.Heinzeller/FV3_RT/rt_235124/fv3_gsd_warmstart_repro
Checking test 005 fv3_gsd_warmstart results ....
 Comparing sfcf027.tile1.nc .........OK
 Comparing sfcf027.tile2.nc .........OK
 Comparing sfcf027.tile3.nc .........OK
 Comparing sfcf027.tile4.nc .........OK
 Comparing sfcf027.tile5.nc .........OK
 Comparing sfcf027.tile6.nc .........OK
 Comparing sfcf048.tile1.nc .........OK
 Comparing sfcf048.tile2.nc .........OK
 Comparing sfcf048.tile3.nc .........OK
 Comparing sfcf048.tile4.nc .........OK
 Comparing sfcf048.tile5.nc .........OK
 Comparing sfcf048.tile6.nc .........OK
 Comparing atmf027.tile1.nc .........OK
 Comparing atmf027.tile2.nc .........OK
 Comparing atmf027.tile3.nc .........OK
 Comparing atmf027.tile4.nc .........OK
 Comparing atmf027.tile5.nc .........OK
 Comparing atmf027.tile6.nc .........OK
 Comparing atmf048.tile1.nc .........OK
 Comparing atmf048.tile2.nc .........OK
 Comparing atmf048.tile3.nc .........OK
 Comparing atmf048.tile4.nc .........OK
 Comparing atmf048.tile5.nc .........OK
 Comparing atmf048.tile6.nc .........OK
 Comparing RESTART/coupler.res .........OK
 Comparing RESTART/fv_core.res.nc ............SKIP for gnu compilers
 Comparing RESTART/fv_core.res.tile1.nc .........OK
 Comparing RESTART/fv_core.res.tile2.nc .........OK
 Comparing RESTART/fv_core.res.tile3.nc .........OK
 Comparing RESTART/fv_core.res.tile4.nc .........OK
 Comparing RESTART/fv_core.res.tile5.nc .........OK
 Comparing RESTART/fv_core.res.tile6.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile1.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile2.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile3.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile4.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile5.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile6.nc .........OK
 Comparing RESTART/fv_tracer.res.tile1.nc .........OK
 Comparing RESTART/fv_tracer.res.tile2.nc .........OK
 Comparing RESTART/fv_tracer.res.tile3.nc .........OK
 Comparing RESTART/fv_tracer.res.tile4.nc .........OK
 Comparing RESTART/fv_tracer.res.tile5.nc .........OK
 Comparing RESTART/fv_tracer.res.tile6.nc .........OK
 Comparing RESTART/sfc_data.tile1.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile2.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile3.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile4.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile5.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile6.nc ............ALT CHECK......NOT OK
 Comparing RESTART/phy_data.tile1.nc .........OK
 Comparing RESTART/phy_data.tile2.nc .........OK
 Comparing RESTART/phy_data.tile3.nc .........OK
 Comparing RESTART/phy_data.tile4.nc .........OK
 Comparing RESTART/phy_data.tile5.nc .........OK
 Comparing RESTART/phy_data.tile6.nc .........OK

  0: The total amount of wall time                        = 800.019065

Test 005 fv3_gsd_warmstart FAIL

That means that runs are still identical. The checksum for arrays weasd and tiice are different. See screenshot for an example.

Screen Shot 2021-07-22 at 11 53 53 AM

@climbfuji
Copy link
Collaborator Author

This may be due to issue NOAA-EMC/fv3atm#348, which will be fixed in my current round of PRs to the authoritative repositories (not sure though, need to check more thoroughly).

@climbfuji
Copy link
Collaborator Author

Fixing NOAA-EMC/fv3atm#348 removes the differences in variable weasd, as expected. Differences in tiice are still there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants