-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VP solver robustness issues ("bad departure points") (was: Verify b4b-ness of different MPI decompositions for the VP solver / performance evaluation of repro-vp
branch)
#40
Comments
Tried setting |
Tried with |
OK, looking at the differences between 1x1 and {2x1, 4x1, 8x1} there are some differences, bigger in 4x1 and 8x1.
|
With 5000 nonlinear iterations, differences are in the same range for 2, 4, 8 procs (vs 1). i.e. [-1E-6, 1E6]. |
note: this tests runs 1 day, and over these 24 time steps, only 4 need over 500 iterations to reach |
With interestingly :
Note that we do not nearly reach 5000 iterations with this value of |
To get the number of iterations to reach the required tolerance:
this is with |
and now aice and hi start to not be b4b, but are still small |
|
With
|
With `precond='diag', with thermo, with transport, we get a model abort:
weird as I did his test two years ago (#33 (comment)), although with @JFLemieux73 je garde une trace de mes expériences MPI dans cette issue si tu veux rester au courant. |
OK, it's because I forgot to also re-enable ridging, the model did not like that (is that expected?...) EDIT after discussing with JF, yes it is expected, convergence can cause that. |
OK, with ridging, advection, and transport, reltol=1E-12, precond diag;
|
idem with pgmres:
I don't think it's only the preconditoner since these results are similar as the 'diag' precond. |
So I did side by side, step by step debugging of 1x1 vs 2x1. The values are the same on both side until the first normalization in the FGMRES algorithm. Since we do a global sum of different numbers, in a different order (that mathematically sum to the same result on all decompositions), then because of floating point arithmetic we get a different norm (of the residual), and then we propagate that to the whole of the vectors by normalizing. So in the end it is not surprising that we get different results. We will run a QC test of different decompositions against each other to ensure we get the same climate. |
Résultats mitigés: 80x1 vs 40x1:
40x1 vs 24x1:
80x1 vs 24x1:
|
OK that was a false alarm, the 24x1 run hit walltime and was killed, but the I re-ran the 24x1 with a longer walltime and both comparisons (with 40x1 and 80x1) now pass) |
I put some more thought into the problem of reproducibility for the global sums, after a comment by @dupontf regarding performing the global sum using quadruple precision. It turns out we already have that capability in CICE, and also even better algorithms: https://cice-consortium-cice.readthedocs.io/en/master/developer_guide/dg_other.html?highlight=reprosum#reproducible-sums I looked more closely at the code and realized I could leverage this capability with only slight modifications. With these modifications done, running 1x1 and 2x1 side by side, I can verify that the global sums done in the dynamics solver are the same on both side, at least for this configuration:
With these settings the restarts are bit4bit! note that I had to also add |
Also passes (b4b) after 1 day (24 time steps) |
And as expected, with |
And it passes with |
So in preparation of a PR with these changes, (CICE-Consortium/CICE@main...phil-blain:CICE:repro-vp), I'm noticing the new code is noticeably slower than the old. EDIT: original version is https://github.com/phil-blain/CICE/commits/repro-vp@%7B2022-07-14%7D This is a little bit surprising ...
Note that this is without
|
OK so I played with Intel Trace Analyzer and collector (ITAC) and Intel Vtune, for both versions (old and new) of the code, following this tutorial: First, running Application Performance Snapshot reveals both versions are MPI bound, and have very poor vectorization (note that both runs are 40x1): new This reveals however that it's not only the added communications that slow the new code, since "MPI time" is 31%, vs. 43% for the old code. I then ran the VTune "HPC Performance Characterization Analysis" for both versions and used the "Compare" feature. This a ranking of the hotspots for time difference between the new and old versions (right column,
|
Getting back to the performance regression after finally getting rid of all the bugs (famous last words) in my new code (see #39 (comment) and following comments). I re-ran the QC test cases on
The listings shows that at least
EDIT I re-ran this from the restart of 2008-01-01, and tweaking the namelist so restarts are written every month. It failed at the same date in the same way. I re-ran it from the restart of 2008-11-01, it again failed at the same date in the same way. I set I changed |
I checked back in my case directory for my earlier long run ( This second time, I also got "bad departure point", at the same exact location (
note that the local indices are different since that run was 40x1... EDIT this is reassuring in a way, because it means:
|
Thinking about it, this probably means that it's something in the forcing, as the QC test cycles the 2005 forcing: $ \grep -E 'fyear_init|ycycle' configuration/scripts/options/set_nml.qc
fyear_init = 2005
ycycle = 1 |
In DDT, the range of
But the range of
|
OK, I refactored again following comments in CICE-Consortium#763. New timings are very encouraging, no changes in non I've re-ran the QC test with this new version of the code (CICE-Consortium/CICE@main...phil-blain:CICE:repro-vp), in the two modes:
ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc.test.221006-115019"bad departure points" on 2006-04-15:
ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc_reprosum.test.lsum8.221006-115329"bad departure points" on 2008-11-13 (same date as above):
For both cases, bump |
In both cases, restarting from the time step before the abort, and settings |
In both cases, the cell where it fails is right on the ice edge. |
In both cases, bumping |
In both cases, dropping the linear tolerance ( |
repro-vp
branch)
The change of default parameters was implemented in CICE-Consortium#774. I'm keeping this open since the underlying robustness issue is not solved. Discussing with JF, it seems the preconditioner is probably not doing a good enough job, which leads to the FGMRES solver having trouble converging... |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
* merge latest master (#4) * Isotopes for CICE (CICE-Consortium#423) Co-authored-by: apcraig <anthony.p.craig@gmail.com> Co-authored-by: David Bailey <dbailey@ucar.edu> Co-authored-by: Elizabeth Hunke <eclare@lanl.gov> * updated orbital calculations needed for cesm * fixed problems in updated orbital calculations needed for cesm * update CICE6 to support coupling with UFS * put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied * Convergence on ustar for CICE. (CICE-Consortium#452) (#5) * Add atmiter_conv to CICE * Add documentation * trigger build the docs Co-authored-by: David A. Bailey <dbailey@ucar.edu> * update icepack submodule * Revert "update icepack submodule" This reverts commit e70d1ab. * update comp_ice.backend with temporary ice_timers fix * Fix threading problem in init_bgc * Fix additional OMP problems * changes for coldstart running * Move the forapps directory * remove cesmcoupled ifdefs * Fix logging issues for NUOPC * removal of many cpp-ifdefs * fix compile errors * fixes to get cesm working * fixed white space issue * Add restart_coszen namelist option * update icepack submodule * change Orion to orion in backend remove duplicate print lines from ice_transport_driver * add -link_mpi=dbg to debug flags (#8) * cice6 compile (#6) * enable debug build. fix to remove errors * fix an error in comp_ice.backend.libcice * change Orion to orion for machine identification * changes for consistency w/ current emc-cice5 (#13) Update to emc/develop fork to current CICE consortium Co-authored-by: David A. Bailey <dbailey@ucar.edu> Co-authored-by: Tony Craig <apcraig@users.noreply.github.com> Co-authored-by: Elizabeth Hunke <eclare@lanl.gov> Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu> Co-authored-by: apcraig <anthony.p.craig@gmail.com> Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com> * Fixcommit (#14) Align commit history between emc/develop and cice-consortium/master * Update CICE6 for integration to S2S * add wcoss_dell_p3 compiler macro * update to icepack w/ debug fix * replace SITE with MACHINE_ID * update compile scripts * Support TACC stampede (#19) * update icepack * add ice_dyn_vp module to CICE_InitMod * update gitmodules, update icepack * Update CICE to consortium master (#23) updates include: * deprecate upwind advection (CICE-Consortium#508) * add implicit VP solver (CICE-Consortium#491) * update icepack * switch icepack branches * update to icepack master but set abort flag in ITD routine to false * update icepack * Update CICE to latest Consortium master (#26) update CICE and Icepack * changes the criteria for aborting ice for thermo-conservation errors * updates the time manager * fixes two bugs in ice_therm_mushy * updates Icepack to Consortium master w/ flip of abort flag for troublesome IC cases * add cice changes for zlvs (#29) * update icepack and pointer * update icepack and revert gitmodules * Fix history features - Fix bug in history time axis when sec_init is not zero. - Fix issue with time_beg and time_end uninitialized values. - Add support for averaging with histfreq='1' by allowing histfreq_n to be any value in that case. Extend and clean up construct_filename for history files. More could be done, but wanted to preserve backwards compatibility. - Add new calendar_sec2hms to converts daily seconds to hh:mm:ss. Update the calchk calendar unit tester to check this method - Remove abort test in bcstchk, this was just causing problems in regression testing - Remove known problems documentation about problems writing when istep=1. This issue does not exist anymore with the updated time manager. - Add new tests with hist_avg = false. Add set_nml.histinst. * revert set_nml.histall * fix implementation error * update model log output in ice_init * Fix QC issues - Add netcdf ststus checks and aborts in ice_read_write.F90 - Check for end of file when reading records in ice_read_write.F90 for ice_read_nc methods - Update set_nml.qc to better specify the test, turn off leap years since we're cycling 2005 data - Add check in c ice.t-test.py to make sure there is at least 1825 files, 5 years of data - Add QC run to base_suite.ts to verify qc runs to completion and possibility to use those results directly for QC validation - Clean up error messages and some indentation in ice_read_write.F90 * Update testing - Add prod suite including 10 year gx1prod and qc test - Update unit test compare scripts * update documentation * reset calchk to 100000 years * update evp1d test * update icepack * update icepack * add memory profiling (#36) * add profile_memory calls to CICE cap * update icepack * fix rhoa when lowest_temp is 0.0 * provide default value for rhoa when imported temp_height_lowest (Tair) is 0.0 * resolves seg fault when frac_grid=false and do_ca=true * update icepack submodule * Update CICE for latest Consortium master (#38) * Implement advanced snow physics in icepack and CICE * Fix time-stamping of CICE history files * Fix CICE history file precision * Use CICE-Consortium/Icepack master (#40) * switch to icepack master at consortium * recreate cap update branch (#42) * add debug_model feature * add required variables and calls for tr_snow * remove 2 extraneous lines * remove two log print lines that were removed prior to merge of driver updates to consortium * duplicate gitmodule style for icepack * Update CICE to latest Consortium/main (#45) * Update CICE to Consortium/main (CICE-Consortium#48) Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics. Refactored eap puny/pi lookups to improve scalar performance Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp Update for sea bed stress calculations * fix comment, fix env for orion and hera * replace save_init with step_prep in CICE_RunMod * fixes for cgrid repro * remove added haloupdates * baselines pass with these extra halo updates removed * change F->S for ocean velocities and tilts * fix debug failure when grid_ice=C * compiling in debug mode using -init=snan,arrays requires initialization of variables * respond to review comments * remove inserted whitespace for uvelE,N and vvelE,N * Add wave-cice coupling; update to Consortium main (CICE-Consortium#51) * add wave-ice fields * initialize aicen_init, which turns up as NaN in calc of floediam export * add call to icepack_init_wave to initialize wavefreq and dwavefreq * update to latest consortium main (PR 752) * add initializationsin ice_state * initialize vsnon/vsnon_init and vicen/vicen_init Co-authored-by: apcraig <anthony.p.craig@gmail.com> Co-authored-by: David Bailey <dbailey@ucar.edu> Co-authored-by: Elizabeth Hunke <eclare@lanl.gov> Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu> Co-authored-by: Minsuk Ji <57227195+MinsukJi-NOAA@users.noreply.github.com> Co-authored-by: Tony Craig <apcraig@users.noreply.github.com> Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
…ICE-Consortium#856) * merge latest master (#4) * Isotopes for CICE (CICE-Consortium#423) Co-authored-by: apcraig <anthony.p.craig@gmail.com> Co-authored-by: David Bailey <dbailey@ucar.edu> Co-authored-by: Elizabeth Hunke <eclare@lanl.gov> * updated orbital calculations needed for cesm * fixed problems in updated orbital calculations needed for cesm * update CICE6 to support coupling with UFS * put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied * Convergence on ustar for CICE. (CICE-Consortium#452) (#5) * Add atmiter_conv to CICE * Add documentation * trigger build the docs Co-authored-by: David A. Bailey <dbailey@ucar.edu> * update icepack submodule * Revert "update icepack submodule" This reverts commit e70d1ab. * update comp_ice.backend with temporary ice_timers fix * Fix threading problem in init_bgc * Fix additional OMP problems * changes for coldstart running * Move the forapps directory * remove cesmcoupled ifdefs * Fix logging issues for NUOPC * removal of many cpp-ifdefs * fix compile errors * fixes to get cesm working * fixed white space issue * Add restart_coszen namelist option * update icepack submodule * change Orion to orion in backend remove duplicate print lines from ice_transport_driver * add -link_mpi=dbg to debug flags (#8) * cice6 compile (#6) * enable debug build. fix to remove errors * fix an error in comp_ice.backend.libcice * change Orion to orion for machine identification * changes for consistency w/ current emc-cice5 (#13) Update to emc/develop fork to current CICE consortium Co-authored-by: David A. Bailey <dbailey@ucar.edu> Co-authored-by: Tony Craig <apcraig@users.noreply.github.com> Co-authored-by: Elizabeth Hunke <eclare@lanl.gov> Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu> Co-authored-by: apcraig <anthony.p.craig@gmail.com> Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com> * Fixcommit (#14) Align commit history between emc/develop and cice-consortium/master * Update CICE6 for integration to S2S * add wcoss_dell_p3 compiler macro * update to icepack w/ debug fix * replace SITE with MACHINE_ID * update compile scripts * Support TACC stampede (#19) * update icepack * add ice_dyn_vp module to CICE_InitMod * update gitmodules, update icepack * Update CICE to consortium master (#23) updates include: * deprecate upwind advection (CICE-Consortium#508) * add implicit VP solver (CICE-Consortium#491) * update icepack * switch icepack branches * update to icepack master but set abort flag in ITD routine to false * update icepack * Update CICE to latest Consortium master (#26) update CICE and Icepack * changes the criteria for aborting ice for thermo-conservation errors * updates the time manager * fixes two bugs in ice_therm_mushy * updates Icepack to Consortium master w/ flip of abort flag for troublesome IC cases * add cice changes for zlvs (#29) * update icepack and pointer * update icepack and revert gitmodules * Fix history features - Fix bug in history time axis when sec_init is not zero. - Fix issue with time_beg and time_end uninitialized values. - Add support for averaging with histfreq='1' by allowing histfreq_n to be any value in that case. Extend and clean up construct_filename for history files. More could be done, but wanted to preserve backwards compatibility. - Add new calendar_sec2hms to converts daily seconds to hh:mm:ss. Update the calchk calendar unit tester to check this method - Remove abort test in bcstchk, this was just causing problems in regression testing - Remove known problems documentation about problems writing when istep=1. This issue does not exist anymore with the updated time manager. - Add new tests with hist_avg = false. Add set_nml.histinst. * revert set_nml.histall * fix implementation error * update model log output in ice_init * Fix QC issues - Add netcdf ststus checks and aborts in ice_read_write.F90 - Check for end of file when reading records in ice_read_write.F90 for ice_read_nc methods - Update set_nml.qc to better specify the test, turn off leap years since we're cycling 2005 data - Add check in c ice.t-test.py to make sure there is at least 1825 files, 5 years of data - Add QC run to base_suite.ts to verify qc runs to completion and possibility to use those results directly for QC validation - Clean up error messages and some indentation in ice_read_write.F90 * Update testing - Add prod suite including 10 year gx1prod and qc test - Update unit test compare scripts * update documentation * reset calchk to 100000 years * update evp1d test * update icepack * update icepack * add memory profiling (#36) * add profile_memory calls to CICE cap * update icepack * fix rhoa when lowest_temp is 0.0 * provide default value for rhoa when imported temp_height_lowest (Tair) is 0.0 * resolves seg fault when frac_grid=false and do_ca=true * update icepack submodule * Update CICE for latest Consortium master (#38) * Implement advanced snow physics in icepack and CICE * Fix time-stamping of CICE history files * Fix CICE history file precision * Use CICE-Consortium/Icepack master (#40) * switch to icepack master at consortium * recreate cap update branch (#42) * add debug_model feature * add required variables and calls for tr_snow * remove 2 extraneous lines * remove two log print lines that were removed prior to merge of driver updates to consortium * duplicate gitmodule style for icepack * Update CICE to latest Consortium/main (#45) * Update CICE to Consortium/main (CICE-Consortium#48) Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics. Refactored eap puny/pi lookups to improve scalar performance Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp Update for sea bed stress calculations * fix comment, fix env for orion and hera * replace save_init with step_prep in CICE_RunMod * fixes for cgrid repro * remove added haloupdates * baselines pass with these extra halo updates removed * change F->S for ocean velocities and tilts * fix debug failure when grid_ice=C * compiling in debug mode using -init=snan,arrays requires initialization of variables * respond to review comments * remove inserted whitespace for uvelE,N and vvelE,N * Add wave-cice coupling; update to Consortium main (CICE-Consortium#51) * add wave-ice fields * initialize aicen_init, which turns up as NaN in calc of floediam export * add call to icepack_init_wave to initialize wavefreq and dwavefreq * update to latest consortium main (PR 752) * add initializationsin ice_state * initialize vsnon/vsnon_init and vicen/vicen_init * Update CICE (CICE-Consortium#54) * update to include recent PRs to Consortium/main * fix for nudiag_set allow nudiag_set to be available outside of cesm; may prefer to fix in coupling interface * Update CICE for latest Consortium/main (CICE-Consortium#56) * add run time info * change real(8) to real(dbl)kind) * fix syntax * fix write unit * use cice_wrapper for ufs timer functionality * add elapsed model time for logtime * tidy up the wrapper * fix case for 'time since' at the first advance * add timer and forecast log * write timer values to timer log, not nu_diag * write log.ice.fXXX * only one time is needed * modify message written for log.ice.fXXX * change info in fXXX log file * Update CICE from Consortium/main (CICE-Consortium#62) * Fix CESMCOUPLED compile issue in icepack. (CICE-Consortium#823) * Update global reduction implementation to improve performance, fix VP bug (CICE-Consortium#824) * Update VP global sum to exclude local implementation with tripole grids * Add functionality to change hist_avg for each stream (CICE-Consortium#827) * Update Icepack to #6703bc533c968 May 22, 2023 (CICE-Consortium#829) * Fix for mesh check in CESM driver (CICE-Consortium#830) * Namelist option for time axis position. (CICE-Consortium#839) * reset timer after Advance to retrieve "wait time" * add logical control for enabling runtime info * remove zsal items from cap * fix typo --------- Co-authored-by: apcraig <anthony.p.craig@gmail.com> Co-authored-by: David Bailey <dbailey@ucar.edu> Co-authored-by: Elizabeth Hunke <eclare@lanl.gov> Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu> Co-authored-by: Minsuk Ji <57227195+MinsukJi-NOAA@users.noreply.github.com> Co-authored-by: Tony Craig <apcraig@users.noreply.github.com> Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com> Co-authored-by: Jun.Wang <Jun.Wang@noaa.gov>
Tried a new suite with
dynpicard
, no OpenMPnone of the three MPI cases are bfb with the 1x1 case.
The text was updated successfully, but these errors were encountered: