Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OMP #680

Merged
merged 15 commits into from
Feb 18, 2022
Merged

Update OMP #680

merged 15 commits into from
Feb 18, 2022

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented Jan 19, 2022

PR checklist

  • Short (1 sentence) summary of your PR:
    Update OpenMP implementation

  • Developer(s):
    apcraig

  • Suggest PR reviewers from list in the column to the right.

  • Please copy the PR test results link or provide a summary of testing completed below.
    The evp1d has changed answers, some unreliable OMP directives were commented out until further debugging can be done. Tested on several platforms/compiler, Cheyenne, Narwhal, Onyx, Mustang, Gaffney, Cori, Conda, Izumi (which was having system issues),

  • https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#c8b9f17ff8f08536da0e3bc33230f89adc430a22

  • https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#d56737c27d4aa0b114b89d1d2dcba97f0319ec59

  • https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#0d4c2e39f4308c89a1905b7e05ba2dd723828b2d

  • https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#6bb8e22efa8fec5846d5434c519d5274996a9a6e

  • How much do the PR code changes differ from the unmodified code?

    • bit for bit (except evp1d)
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?

    • Yes
    • No
  • Does this PR add any new test cases?

    • Yes, add omp_suite which tests bit-for-bit with different thread counts
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)

    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

  • Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics.

    • Add SCHEDULE(runtime) to many OMP directive to allow env to set OpenMP schedule. Add ICE_OMPSCHED to cice.settings to allow the schedule to be user set more easily. Add set_env test options for a few schedule choices.
    • Add output to log file diagnosing thread schedule on master task
    • Update OMP_STACKSIZE on many platforms to allow OpenMP to run properly. Default setting was too small causing problems with OpenMP even working
    • VP dynamics still has some debugging to be done (Investigate decomp_suite failures with dynpicard option phil-blain/CICE#39, VP exact restart and other nonBFB problems #518), so OpenMP validation was defered
    • A problem was detected in EVP-1d, in ice_dyn_evp_1d_kernel and OpenMP was turned off there. An issue will be created.
  • Refactored eap puny/pi lookups to improve scalar performance

  • Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp (want it set to zero for consistency across different decompositions/restart files). Particularly a problem in upwind advection where icepack_compute_tracers is called for every gridpoint (unclear why, but calling it only where tmask is true causes problems).

  • Update timers

    • Add ability to use tmp timers, partly commented out.
    • Added timer_stats namelist variable to turn on timer stats
    • Fixed some timer start/stop calls in iblk loops (need to include iblk argument)
    • Add timer updstate (update state calls)
  • Ported to Narwhal intel, gnu, cray, aocc

  • Add perf_suite (to evaluate CICE performance) and omp_suite (to validate OpenMP)

  • Refactored the compare feature in the test scripts and renamed some tests as a result

    • Add ICE_BFBTYPE to cice.settings to define compare process (log, restart, qc, or combinations)
    • Get rid of logbfb test (change to smoke with cmplog option)
    • Get rid of qcchkf and qcnonbfb (use qcchk with qcchkf option)
    • Add several new test options
  • Update documentation

- Fix call to timers in block loops
- Fix some OMP Private variables
- Test OMP Scheduling, add SCHEDULE(runtime) to some OMP loops
- Review column and advection OMP implementation
- ADD OMP_TIMERS CPP option (temporary) to time threaded sections
- Add timer_tmp timers (temporary)
- Add omp_suite.ts test suite
- Add ability to set OMP_SCHEDULE via options (ompscheds, ompscheds1, ompschedd1)
- Add timer_stats namelist to turn on extra timer output information
- Add ICE_BFBTYPE and update bit-for-bit comparison logic in scripts
- Update qc and logbfb testing
- Remove logbfb and qchkf tests, add cmplog, cmplogrest, cmprest set_env files to set ICE_BFBTYPE
- Update documentation
Update documentation
…ting testing. Specifically fix upwind advection.

- Comment out OMP in ice_dyn_evp_1d_kernel, was producing non bit-for-bit results with different thread counts
@apcraig
Copy link
Contributor Author

apcraig commented Jan 19, 2022

Closes #114.

Copy link
Contributor

@dabail10 dabail10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am impressed by the amount of work here. I can't see any obvious problems here. I am a bit concerned about the Tsfc values on land. However, I agree with what you have done.

@apcraig
Copy link
Contributor Author

apcraig commented Jan 20, 2022

CICE_OMP_perf_220119.xlsx

CICE performance on several machines is summarized in the attached Excel spreadsheet for this version of the code. In particular, the three plots show Dynamics, Advection, and Column times for a fixed problem (gx1, 64 cores, 16x16 block size) varying tasks and threads (always running on 64 cores). OpenMP is working reasonably OK. It's still not generally faster than all MPI (but is in a few cases and is close in many others). This also shows the relative difference across a few machines and compilers. Things are working better now than they were before, especially now that dynamics OMP is turned on.

The other thing to note in the attached spreadsheet is the difference in time between (scheds, scheds1, schedd1). These are different OMP schedules. scheds="STATIC", scheds1="STATIC,1", schedd1="DYNAMIC,1". You can see cases where the dynamic scheduling overhead is not good (onyx_intel) and other cases where the dynamic scheduling does relatively better. In general, the scheduling has a greater impact on advection and column timers probably because load balance across work is more important in those parts of the code.

A couple other notes. 64x0 is 64 MPI tasks with threading turned off in compilation. 64x1 is 64 MPI tasks with 1 thread per task compiled with threading on.

@phil-blain
Copy link
Member

Just a note that I'm on vacation this coming week, but I'd like to take a look at this when I'm back.

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I'm impressed. This looks like an enormous amount of work! I'd really like to know what your workflow was for identifying OMP problems and fixing them.

Just a few questions, including this one: you refactored the compare feature in the test scripts and renamed some tests as a result -- maybe I missed it, but I didn't see relevant changes to the documentation. Are your mods, e.g. for running QC tests, invisible for users?

! timer_tmp2, &! for temporary timings
! timer_tmp3, &! for temporary timings
! timer_tmp4, &! for temporary timings
! timer_tmp5, &! for temporary timings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you planning to clean these tmp timers up later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I go back and forth on this. I was thinking I'd just leave them there but commented out. I find when I'm debugging any performance thing, I end up implementing the same temporary timers and dropping them in where I need a finer detail look. Then removing them at the end. It might be nice to just have them readily available anytime anyone needs to use them. I can remove them if that's preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent. Sometimes it's nice to have commented-out code sitting around for ease of use. Comments/opinions from others are welcome!

! Diagnose OpenMP thread schedule, force order in output
!-----------------------------------------------------------------

#if defined (_OPENMP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this done in ice_grid.F90?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. It has to be done after the decomposition is established. There is no obvious place for something like this in the code, but I'm happy to move it elsewhere. I wanted to keep it out of the driver code because then it would need to be duplicated in all of them. And it needed to be somewhere that is always called, so it couldn't be a "first call" thing in the column physics or dynamics. I am open to moving this to another part of the code. I think it's handy to have this information diagnosed, there is no way to tell otherwise. In fact, I added this stuff because I wanted to verify the OMP schedule was working. That's how I learned it wasn't and how we need to add the SCHEDULE(runtype) argument to the OMP pragma. This is very confusing. I pointed the problems out to the cheyenne help desk since they were helping me with some of this. I proposed a change to the OpenMP standards to fix this and preserve backwards compatibility, but don't know if it'll go anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it belonged in the module where decompositions are done (ice_blocks?). But I guess that necessitates calling it from every driver.

Interesting that you proposed a change to the OpenMP standards! We should track that in Consortium reporting docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link for your proposition @apcraig ? I'd like to follow any developments:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forwarded my input to NCAR who has a person on the OpenMP committee. Nothing more formal than that.

@@ -33,6 +33,8 @@ module load cray-hdf5/1.12.0.4
setenv NETCDF_PATH ${NETCDF_DIR}
limit coredumpsize unlimited
limit stacksize unlimited
setenv OMP_STACKSIZE 128M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know what to set OMP_STACKSIZE to? Do you provide guidance on this in the user docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add that information to the documentation (see https://github.com/CICE-Consortium/CICE/pull/680/files#diff-ea684240af50d1d378a32f8805b71db3773d3eb5df8342f1bf262637bc929529). In fact, that was one of the things that prevented the OpenMP from working on several platforms. Many platforms had the default set to 0 or 4M which caused seg faults in OpenMP. When I increased the value, things worked. There wasn't always a problem in the OpenMP implementation in CICE that caused failures. Cheyenne has the default set to 64M which is why things were often working there but not elsewhere and gave me a clue. This took a while to figure out. I don't know what value is the minimum needed for CICE. It will depend on the problem size, number of tasks and threads, decomposition, etc. I'm pretty sure the tests I was running will work with 16M. So I set the defaults higher than that on some of the platforms where the default was too small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it set to at least 16M for all currently supported machines? Why are some machines as high as 128M, if 16M is enough for the current test suites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only changed the setting for the machines I have access to and tested. I'm reluctant to set values for machines I don't have access to. They may already have reasonable defaults and/or settings that the systems folks prefer you use. I recognize that I set some machines to 32M, 64M, and 128M. That was not a careful process. For other machines, I just left the default value because it worked. I tried different values on different machines to see what would work and if it worked, I just stuck with it. I did not try to figure out the minimum value needed for testing or by machine/compiler. The compilers/implementation may also change in the future, so it makes sense to set it above the minimum required value anyway. All the values are pretty reasonable as is. If the OMP_STACKSIZE needed to be set so high that if affected other parts of the memory use, then we'd want to be more careful. As it is, I think we have a pretty large range that OMP_STACKSIZE could be set without affecting other aspects of the memory/performance (say 16M to 256M) for our testing. I don't think we need to be particularly precise here.

@TillRasmussen
Copy link
Contributor

@apcraig This looks really nice. Approved from here.

@apcraig
Copy link
Contributor Author

apcraig commented Jan 24, 2022

I will review the qc documentation. You're right that I probably need to make a few minor updates there.

With regard to workflow, it was a bit of everything. I was planning to use debugging/performance tools if needed. But I started with just testing things carefully. I added some new test suites to check the OpenMP results. I reviewed each OMP pragma line carefully (for private variables) and turned them on and off to check we were getting identical results. I found problems (like the timer iblk implementation) as I went and fixed those things, that helped quite a bit. I tested on multiple platforms and compilers which is a big pain. But that was a key. I used "setenv OMP_DISPLAY_ENV TRUE" to write out the OMP settings at run time. This helped me pinpoint differences in defaults for machines where the OpenMP was working and machines where it wasn't. There was a lot of trial and error and confusion. I spent a good part of 3 weeks on this, so it took a bit of time. It was good I had a window to focus. I don't think I could have made the same progress working part-time on it. Lots of googling too. In the end, I didn't need the debugging and performance tools. Our timers were working well. I also didn't know how I would detect missing private OMP variables via a debugger. I guess you track where things diverge and then start digging there, but understanding that the problem was partly the default OpenMP setting was an important piece why we were getting inconsistent results on different machines and compilers.

@apcraig
Copy link
Contributor Author

apcraig commented Jan 24, 2022

I reviewed the qc documentation and it's fine. It's only the automated testing via test suites that is changed. The manual process is not.

@apcraig
Copy link
Contributor Author

apcraig commented Jan 24, 2022

@dabail10 regarding the Tsfc values on land. This was a problem with testing, not results. The exact restart comparisons were failing in some cases with different decomps/tasks/threads because Tsfc was being set to the freezing temperature on land in some cases. With different decomps/tasks/threads, the land block elimination changes, and eliminated land blocks had values of zero for Tsfc but freezing temperature for land that was not eliminated making the restarts not identical across different test cases. It turns out, this was caused primarily by the upwind advection scheme and showed up in just a few tests. I tried to simply stop computation of the upwind advection on the land cells, but that changed the answers (which is scary, but I decided not to dig deeper). So I needed to mask the Tsfc in the upwind advection after the computation was complete. That's the change in line 1745 of ice_dyn_transport.F90. The other changes are not necessarily required, but I added them for good measure. There could be cases where a restart file has non-zero Tsfc on the land in the restart file and then values are never reset to zero. Anyway, the eliminated land always has values of zero when restart/history files are written. For the most part, all fields in CICE on land are also zero, so that's fine. Tsfc is one field where this wasn't entirely the case.

@rgrumbine
Copy link
Contributor

rgrumbine commented Jan 25, 2022 via email

@TillRasmussen
Copy link
Contributor

The issue with the OMP within the 1d has been tracked. See #681.

@eclare108213
Copy link
Contributor

@phil-blain just pinging you again on this one, when you have a chance. thx!

@apcraig
Copy link
Contributor Author

apcraig commented Feb 10, 2022

It would be good to merge this today or tomorrow if we can..... Would like to get the weekend testing operating on these updates as well as possible merge to the C/CD branch. Thanks.

@phil-blain
Copy link
Member

Sorry, I'll review it first thing tomorrow morning.

Copy link
Member

@phil-blain phil-blain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a thorough look at most of the commits, and skimmed through the following ones:

It mostly looks good, thanks a lot for this work. It's still on my list to go back and fix the noted issues in the VP module, including OpenMP validation. This here PR should be a good start.

Comment on lines 479 to 483
!--- tcraig, tcx, this omp loop leads to a seg fault in gnu
!--- need to check private variables and debug further
!$TCXOMP PARALLEL DO PRIVATE(iblk,ilo,ihi,jlo,jhi,this_block,n,m, &
!$TCXOMP indxinc,indxjnc,mmask,tmask,istop,jstop,l_stop)
!$OMP PARALLEL DO PRIVATE(iblk,ilo,ihi,jlo,jhi,this_block,n, &
!$OMP indxinc,indxjnc,mmask,tmask,istop,jstop,l_stop) &
!$OMP SCHEDULE(runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it still lead to a segfault with GCC ? if not the comment could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those comment lines are already removed. Wonder why you're not seeing that. Check the current file diffs again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I realized that after submitting my review, It's because I'm reviewing commit by commit... sorry for the noise.

Comment on lines 691 to 696
!--- need to check private variables and debug further
!$TCXOMP PARALLEL DO PRIVATE(iblk,i,j,ilo,ihi,jlo,jhi,this_block,n,m, &
!$TCXOMP edgearea_e,edgearea_n,edge,iflux,jflux, &
!$TCXOMP xp,yp,indxing,indxjng,mflxe,mflxn, &
!$TCXOMP mtflxe,mtflxn,triarea,istop,jstop,l_stop)
!$OMP PARALLEL DO PRIVATE(iblk,i,j,ilo,ihi,jlo,jhi,this_block,n, &
!$OMP edgearea_e,edgearea_n,edge,iflux,jflux, &
!$OMP xp,yp,indxing,indxjng,mflxe,mflxn, &
!$OMP mtflxe,mtflxn,triarea,istop,jstop,l_stop) &
!$OMP SCHEDULE(runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. I check both the file diffs in the PR and the code in my sandbox.

Comment on lines 166 to 168
#ifdef OMP_TIMERS
use OMP_LIB
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work with omp_lib (I'm thinking it should) ? that would make case more consistent with our own modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does not exist in this version. I'm confused what's happening here.

Comment on lines 386 to 388
#ifdef OMP_TIMERS
call ice_barrier()
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curous why we need a barrier at a few places ? this is an MPI barrier, no? How does it interact with OpenMP ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -386,6 +480,7 @@ subroutine ice_step
call accum_hist (dt) ! history file
call ice_timer_stop(timer_hist) ! history

call ice_barrier()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one is not protected by OPM_TIMERS. is that what we want ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -43,7 +41,9 @@ set ICE_RUNLOG_FILE = "cice.runlog.\${stamp}"
#--------------------------------------------
cd \${ICE_RUNDIR}

setenv OMP_NUM_THREADS ${nthrds}
setenv OMP_NUM_THREADS \${ICE_NTHRDS}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On some systems, it's only via the job scheduler that you can specify the number of OpenMP threads, changing OMP_NUM_THREADS at runtime has no effect. I think it's the case at least for PBS Pro, the number of threads is set from the ompthreads directive in the select statement. Maybe that would be worth to mention in the documentation ? (maybe not, just thinking out loud...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of that. I guess that's part of setting up the port properly.

Comment on lines 1092 to 1098
| 6 | Thermo | vertical thermodynamics |
| 6 | Thermo | vertical thermodynamics, part of Column timer |
+--------------+-------------+----------------------------------------------------+
| 7 | Shortwave | SW radiation and albedo |
| 7 | Shortwave | SW radiation and albedo, part of Thermo timer |
+--------------+-------------+----------------------------------------------------+
| 8 | Ridging | mechanical redistribution |
| 8 | Ridging | mechanical redistribution, part of Column timer |
+--------------+-------------+----------------------------------------------------+
| 9 | FloeSize | flow size |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice additions!

Copy link
Contributor

@TillRasmussen TillRasmussen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.
I would like to include OMP directives in evp_1d. A QC should be run to check differences. It is binary identical when -no-vec flag is added.

@CICE-Consortium CICE-Consortium deleted a comment from dabail10 Feb 17, 2022
@eclare108213
Copy link
Contributor

@apcraig please go ahead and merge this PR whenever you are happy with it. @TillRasmussen suggested running QC to check the non-BFB-ness in evp1d, which can be done separately as part of addressing #681.

@apcraig apcraig merged commit d1e972a into CICE-Consortium:main Feb 18, 2022
apcraig added a commit to apcraig/CICE that referenced this pull request Mar 8, 2022
* update icepack, rename snwITDrdg to snwitdrdg (CICE-Consortium#658)

* Change max_blocks for rake tests on izumi (nothread). (CICE-Consortium#665)

* Fix some raketests for izumi

* fix some rake tests

* Makefile: make Fortran object files depend on their dependency files (CICE-Consortium#667)

When 'make' is invoked on the CICE Makefile, the first thing it does is
to try to make the included dependency files (*.d) (which are in fact
Makefiles themselves) [1], in alphabetical order.

The rule to make the dep files have the dependency generator, 'makdep',
as a prerequisite, so when processing the first dep file, make notices
'makdep' does not exist and proceeds to build it. If for whatever reason
this compilation fails, make will then proceed to the second dep file,
notice that it recently tried and failed to build its dependency
'makdep', give up on the second dep file, proceed to the third, and so
on.

In the end, no dep file is produced. Make then restarts itself and
proceeds to build the code, which of course fails catastrophically
because the Fortran source files are not compiled in the right order
because the dependency files are missing.

To avoid that, add a dependency on the dep file to the rules that make
the object file out of the Fortran source files. Since old-fashioned
suffix rules cannot have their own prerequisites [2], migrate the rules
for the Fortran source files to use pattern rules [3] instead. While at
it, also migrate the rule for the C source files.

With this new dependency, the builds abort early, before trying to
compile the Fortran sources, making it easier to understand what
has gone wrong.

Since we do not use suffix rules anymore, remove the '.SUFFIXES' line
that indicates which extension to use suffix rules for (but keep the
line that eliminates all default suffix rules).

[1] https://www.gnu.org/software/make/manual/html_node/Remaking-Makefiles.html
[2] https://www.gnu.org/software/make/manual/html_node/Suffix-Rules.html
[3] https://www.gnu.org/software/make/manual/html_node/Pattern-Rules.html#Pattern-Rules

* Fix multi-pe advection=none bug (CICE-Consortium#664)

* update parsing scripts to improve robustness, fix multi-pe advection=none

* Update cice script to improve performance including
minor refactoring of parse_namelist and parse_settings
to reduce cost and ability to use already setup ice_in file
from a prior case in the suite.

Added commented out timing ability in cice.setup.

Change test default to PEND from FAIL.

* fix cice.setup for case

* add sedbak implementation to support Mac sed

* s/spend/spent

* nuopc/cmeps driver updates (CICE-Consortium#668)


* add debug_model feature
* add required variables and calls for tr_snow

* Main namelist debug (CICE-Consortium#671)

* Adding method to write erroneous namelist options

* Remove erroneous comma in abort_ice for namelist check

* Added check for zbgc_nml. I missed that namelist in this file.

* Added space and colons for namelist error output

* Added space and colons for namelist error output

Co-authored-by: David A. Hebert <dhebert@nrlssc.navy.mil>

* NUOPC/CMEPS cap updates (CICE-Consortium#670)

* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* Update NUOPC cap to work with latest CICE6 master

* nuopc,cmeps or s2s build updates

* fixes for dbug_flag

* Update nuopc2 to latest CICE master

* Fix some merge problems

* Fix dbug variable

* Manual merge of UFS changes

* fixes to get CESM B1850 compset working

* refactored ice_prescribed_mod.F90 to work with cdeps rather than the mct data models

* Fix use_restart_time

* changes for creating masks at runtime

* added ice_mesh_mod

* implemented area correction factors as option

* more cleanup

* Fix dragio

* Fix mushy bug

* updates to nuopc cap to resolve inconsistency between driver inputs and cice namelists

* changed error message

* added icepack_warnings_flush

* updates to get ice categories working

* updates to have F compset almost working with cice6 - still problems in polar regions - need to resolve 253K/cice6 versus 273K/cice5 differences

* changed tolerance of mesh/grid comparison

* added issues raised in PR

* Update CESM-CICE sync with new time manager

* Add back in latlongrid

* Add new advanced snow physics to driver

* Fix restart issue with land blocks

* Update mesh check in cap

* fix scam problems

* reintroduced imesh_eps check

* Put dragio in the namelist instead

* Remove redundant code

* Fix some indents

Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: Denise Worthen <denise.worthen@noaa.gov>

* Add CESM1_PIO for fill value check (CICE-Consortium#675)

* Add CESM1_PIO for fill value check

* Revert PIO_FILL_DOUBLE change for now

* - Update the namelist read to make the group order flexible. (CICE-Consortium#677)

- Remove recent update to namelist read that traps bad lines, it conflicts
  with flexibility to read groups in random order picked up by NAG.
- change print* statements to write(nu_diag,*)

* Port to Narwhal and add perf_suite (CICE-Consortium#678)

* Add narwhal intel, gnu, cray, aocc
Add perf_suite.ts

* update narwhal_cray and perf_suite

* Update OMP (CICE-Consortium#680)

* Add narwhal intel, gnu, cray, aocc
Add perf_suite.ts

* update narwhal_cray and perf_suite

* Review and update OMP implementation

- Fix call to timers in block loops
- Fix some OMP Private variables
- Test OMP Scheduling, add SCHEDULE(runtime) to some OMP loops
- Review column and advection OMP implementation
- ADD OMP_TIMERS CPP option (temporary) to time threaded sections
- Add timer_tmp timers (temporary)
- Add omp_suite.ts test suite
- Add ability to set OMP_SCHEDULE via options (ompscheds, ompscheds1, ompschedd1)

* - Review diagnostics OMP implementation
- Add timer_stats namelist to turn on extra timer output information
- Add ICE_BFBTYPE and update bit-for-bit comparison logic in scripts
- Update qc and logbfb testing
- Remove logbfb and qchkf tests, add cmplog, cmplogrest, cmprest set_env files to set ICE_BFBTYPE
- Update documentation

* Update EVP OMP implementation

* - Refactor puny/pi scalars in eap dynamics to improve performance
- Update OMP in evp and eap

* Clean up

* Comment out temporary timers

* Update OMP env variables on Narwhal

* Update gaffney OMP_STACKSIZE

* update OMP_STACKSIZE on cori

* Update Onyx OMP_STACKSIZE
Update documentation

* Update OMP_STACKSIZE on mustang

* - Update Tsfc values on land in various places in the code, was affecting testing.  Specifically fix upwind advection.
- Comment out OMP in ice_dyn_evp_1d_kernel, was producing non bit-for-bit results with different thread counts

* updating LICENSE.pdf for 2022

* seabed stress - remove if statements (CICE-Consortium#673)

* refactor seabed_stress. Bit for bit

* Removed if statement from stepu. Results are binary identical, however taubx and tauby is updated on all iterations instead of just the last one. Not used within iteration

* changed capping from logical to numeric in order to remove if statement. Moved call to deformation out of loop

* clean dyn_finish, correct intent(inout) to intent(in) for Cw, resse Cb in stepu, remove if from seabed_stress_LKD

* Reolve conflicts after updating main

* modified environment for Freya to accomodate for additional OMP commands

* Requested changes after review. Only changed in seabed stress and not bit for bit if cor=0.0

added legacy comment in ice_dyn_finish

* move deformation to subcycling

* - Update version and copyright. (CICE-Consortium#691)

- Remove gordon and conrad machines.
- Add setenv OMP_STACKSIZE commented out in env files
- Update Icepack to fc4b809

* add OMP_STACKSIZE for koehr (CICE-Consortium#693)

* Update C/CD deformations calls to be consistent with main B changes
Update tauxbx, tauxby calculations on C/CD to be consistent with main B changes

* Update OpenMP in C/CD implementation
Extend omp_suite to include C/CD tests

* reconcile recent merge problem

* set default value of capping to 0. in vp cases for backwards compatibility

* Set capping to 1.0 in vp consistent with evp, changes answers for vp configurations

Co-authored-by: David A. Bailey <dbailey@ucar.edu>
Co-authored-by: Philippe Blain <philippe.blain@canada.ca>
Co-authored-by: Denise Worthen <denise.worthen@noaa.gov>
Co-authored-by: daveh150 <david.hebert@nrlssc.navy.mil>
Co-authored-by: David A. Hebert <dhebert@nrlssc.navy.mil>
Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>
Co-authored-by: TRasmussen <33480590+TillRasmussen@users.noreply.github.com>
phil-blain added a commit to phil-blain/CICE that referenced this pull request May 24, 2022
Since d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), OpenMP
threading is active in 'ice_transport_remap.F90', and the default OpenMP
stack size needs to be adjusted to avoid stack overflows.

Add that variable for the new machines, setting it to a 64 Mb size
as used for other machines.
phil-blain added a commit to phil-blain/CICE that referenced this pull request May 24, 2022
Since d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), OpenMP
threading is active in 'ice_transport_remap.F90', and the default OpenMP
stack size needs to be adjusted to avoid stack overflows.

Add that variable for the new machines, setting it to a 64 Mb size
as used for other machines.
phil-blain added a commit to phil-blain/CICE that referenced this pull request May 27, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).
phil-blain added a commit to phil-blain/CICE that referenced this pull request May 30, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).
phil-blain added a commit to phil-blain/CICE that referenced this pull request May 30, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).
phil-blain added a commit to phil-blain/CICE that referenced this pull request May 30, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).
apcraig pushed a commit that referenced this pull request Jun 2, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Aug 25, 2022
When the implicit VP solver was added in f7fd063 (dynamics: add implicit
VP solver (CICE-Consortium#491), 2020-09-22), it had not yet been tested with OpenMP
enabled.

The OpenMP implementation was carefully reviewed and then fixed in
d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), which lead to all runs of the
'decomp' suite completing and all restart tests passing. The 'bfbcomp'
tests are still failing, but this is due to the code not using the CICE
global sum implementation correctly, which will be fixed in the next
commits.

Update the documentation accordingly.
dabail10 pushed a commit to ESCOMP/CICE that referenced this pull request Oct 4, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Oct 5, 2022
When the implicit VP solver was added in f7fd063 (dynamics: add implicit
VP solver (CICE-Consortium#491), 2020-09-22), it had not yet been tested with OpenMP
enabled.

The OpenMP implementation was carefully reviewed and then fixed in
d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), which lead to all runs of the
'decomp' suite completing and all restart tests passing. The 'bfbcomp'
tests are still failing, but this is due to the code not using the CICE
global sum implementation correctly, which will be fixed in the next
commits.

Update the documentation accordingly.
phil-blain added a commit to phil-blain/CICE that referenced this pull request Oct 17, 2022
When the implicit VP solver was added in f7fd063 (dynamics: add implicit
VP solver (CICE-Consortium#491), 2020-09-22), it had not yet been tested with OpenMP
enabled.

The OpenMP implementation was carefully reviewed and then fixed in
d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), which lead to all runs of the
'decomp' suite completing and all restart tests passing. The 'bfbcomp'
tests are still failing, but this is due to the code not using the CICE
global sum implementation correctly, which will be fixed in the next
commits.

Update the documentation accordingly.
phil-blain added a commit to phil-blain/CICE that referenced this pull request Oct 17, 2022
When the OpenMP implementation was reviewed and fixed in d1e972a (Update
OMP (CICE-Consortium#680), 2022-02-18), the 'PRIVATE' clause of the OpenMP directive
for the loop where 'dyn_prep2' is called in 'implicit_solver' was
corrected in line with what was done in 'ice_dyn_evp', but OpenMP was
left unactivated for this loop (the 'TCXOMP' was not changed to a real
'OMP' directive).

Activate OpenMP for this loop. All runs and restart tests of the
'decomp_suite' still pass with this change.
apcraig pushed a commit that referenced this pull request Oct 20, 2022
* doc: fix typo in index (bfbflag)

* doc: correct default value of 'maxits_nonlin'

The "Table of namelist options" in the user guide lists 'maxits_nonlin'
as having a default value of 1000, whereas its actual default is 4, both
in the namelist and in 'ice_init.F90'. This has been the case since the
original implementation of the implicit solver in f7fd063 (dynamics: add
implicit VP solver (#491), 2020-09-22).

Fix the documentation.

* doc: VP solver is validated with OpenMP

When the implicit VP solver was added in f7fd063 (dynamics: add implicit
VP solver (#491), 2020-09-22), it had not yet been tested with OpenMP
enabled.

The OpenMP implementation was carefully reviewed and then fixed in
d1e972a (Update OMP (#680), 2022-02-18), which lead to all runs of the
'decomp' suite completing and all restart tests passing. The 'bfbcomp'
tests are still failing, but this is due to the code not using the CICE
global sum implementation correctly, which will be fixed in the next
commits.

Update the documentation accordingly.

* ice_dyn_vp: activate OpenMP in 'dyn_prep2' loop

When the OpenMP implementation was reviewed and fixed in d1e972a (Update
OMP (#680), 2022-02-18), the 'PRIVATE' clause of the OpenMP directive
for the loop where 'dyn_prep2' is called in 'implicit_solver' was
corrected in line with what was done in 'ice_dyn_evp', but OpenMP was
left unactivated for this loop (the 'TCXOMP' was not changed to a real
'OMP' directive).

Activate OpenMP for this loop. All runs and restart tests of the
'decomp_suite' still pass with this change.

* machines: eccc : add ICE_MACHINE_MAXRUNLENGTH to ppp[56]

* machines: eccc: use PBS-enabled OpenMPI for 'ppp6_gnu'

The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is
not compiled with support for PBS. This leads to failures as the MPI
runtime does not have the same view of the number of available processors
as the job scheduler.

Use our own build of OpenMPI, compiled with PBS support, for the
'ppp6_gnu'  environment, which uses OpenMPI.

* machines: eccc: set I_MPI_FABRICS=ofi

Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an
intermittent bug where a call to 'MPI_Waitall' fails with:

    Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

and no core dump is produced. This affects at least these cases of the
'decomp' suite:

- *_*_restart_gx3_16x2x1x1x800_droundrobin
- *_*_restart_gx3_16x2x2x2x200_droundrobin

This was reported to Intel and they suggested setting the variable
'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This
disables shared memory transport and indeeds fixes the failures.

Set this variable for all ECCC machine files using Intel MPI.

[1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html

* machines: eccc: set I_MPI_CBWR for BASEGEN/BASECOM runs

Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
[1], [2]), does not (by default) guarantee that repeated runs of the same
code on the same machine with the same number of MPI ranks yield the
same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.

Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
repeated runs of the code giving different answers, and baseline
comparing runs with code built from the same commit failing.

When generating a baseline or comparing against an existing baseline,
set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
using Intel MPI [3], so that (processor) topology-aware collective
algorithms are not used and results are reproducible.

Note that we do not need to set this variable on robert or underhill, on
which jobs have exclusive node access and thus job placement (on
processors) is guaranteed to be reproducible.

[1] https://stackoverflow.com/a/45916859/
[2] https://scicomp.stackexchange.com/a/2386/
[3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411

* ice_dyn_vp: fgmres: exit early if right-hand-side vector is zero

If starting a run with with "ice_ic='none'" (no ice), the linearized
problem for the ice velocity A x = b will have b = 0, since all terms in
the right hand side vector will be zero:

- strint[xy] is zero because the velocity is zero
- tau[xy] is zero because the ocean velocity is also zero
- [uv]vel_init is zero
- strair[xy] is zero because the concentration is zero
- strtlt[xy] is zero because the ocean velocity is zero

We thus have a linear system A x = b with b=0, so we
must have x=0.

In the FGMRES linear solver, this special case is not taken into
account, and so we end up with an all-zero initial residual since
workspace_[xy] is also zero because of the all-zero initial guess
'sol[xy]', which corresponds to the initial ice velocity. This then
leads to a division by zero when normalizing the first Arnoldi vector.

Fix this special case by computing the norm of the right-hand-side
vector before starting the iterations, and exiting early if it is zero.
This is in line with the GMRES implementation in SciPy [1].

[1] https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/sparse/linalg/_isolve/iterative.py#L620-L628

Close: phil-blain#42

* ice_dyn_vp: add global_norm, global_dot_product functions

The VP solver uses a linear solver, FGMRES, as part of the non-linear
iteration. The FGMRES algorithm involves computing the norm of a
distributed vector field, thus performing global sums.

These norms are computed by first summing the squared X and Y components
of a vector field in subroutine 'calc_L2norm_squared', summing these
over the local blocks, and then doing a global (MPI) sum using
'global_sum'.

This approach does not lead to reproducible results when the MPI
distribution, or the number of local blocks, is changed, for reasons
explained in the "Reproducible sums" section of the Developer Guide
(mostly, floating point addition is not associative). This was partly
pointed out in [1] but I failed to realize it at the time.

Introduce a new function, 'global_dot_product', to encapsulate the
computation of the dot product of two grid vectors, each split into two
arrays (for the X and Y components).

Compute the reduction locally as is done in 'calc_L2norm_squared', but
throw away the result and use the existing 'global_sum' function when
'bfbflag' is active, passing it the temporary array used to compute the
element-by-element product.

This approach avoids a performance regression from the added work done
in 'global_sum', such that non-bfbflag runs are as fast as before.

Note that since 'global_sum' loops on the whole array (and not just ice
points as 'global_dot_product'), make sure to zero-initialize the 'prod'
local array.

Also add a 'global_norm' function implemented using
'global_dot_product'. Both functions will be used in subsequent commits
to ensure bit-for-bit reproducibility.

* ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit output reproducibility

Make the results of the VP solver reproducible if desired by refactoring
the code to use the subroutines 'global_norm' and 'global_dot_product'
added in the previous commit.

The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the
preconditioner 'pgmres' which uses the same algorithm, and the
Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'.

These modifications do not change the number of global sums in the
fgmres, pgmres and the MGS algorithm. For the CGS algorithm, there is
(in theory) a slight performance impact as 'global_dot_product' is
called inside the loop, whereas previously we called
'global_allreduce_sum' after the loop to compute all 'initer' sums at
the same time.

To keep that optimization, we would have to implement a new interface
'global_allreduce_sum' which would take an array of shape
(nx_block,ny_block,max_blocks,k) and sum over their first three
dimensions before performing the global reduction over the k dimension.

We choose to not go that route for now mostly because anyway the CGS
algorithm is (by default) only used for the PGMRES preconditioner, and
so the cost should be relatively low as 'initer' corresponds to
'dim_pgmres' in the namelist, which should be kept low for efficiency
(default 5).

These changes lead to bit-for-bit reproducibility (the decomp_suite
passes) when using 'precond=ident' and 'precond=diag' along with
'bfbflag=reprosum'. 'precond=pgmres' is still not bit-for-bit because
some halo updates are skipped for efficiency. This will be addressed in
a following commit.

[1] #491 (comment)

* ice_dyn_vp: do not skip halo updates in 'pgmres' under 'bfbflag'

The 'pgmres' subroutine implements a separate GMRES solver and is used
as a preconditioner for the FGMRES linear solver. Since it is only a
preconditioner, it was decided to skip the halo updates after computing
the matrix-vector product (in 'matvec'), for efficiency.

This leads to non-reproducibility since the content of the non-updated
halos depend on the block / MPI distribution.

Add the required halo updates, but only perform them when we are
explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
set to something else than 'not'.

Adjust the interfaces of 'pgmres' and 'precondition' (from which
'pgmres' is called) to accept 'halo_info_mask', since it is needed for
masked updates.

Closes #518

* ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit log reproducibility

In the previous commits we ensured bit-for-bit reproducibility of the
outputs when using the VP solver.

Some global norms computed during the nonlinear iteration still use the
same non-reproducible pattern of summing over blocks locally before
performing the reduction. However, these norms are used only to monitor
the convergence in the log file, as well as to exit the iteration when
the required convergence level is reached ('nlres_norm'). Only
'nlres_norm' could (in theory) influence the output, but it is unlikely
that a difference due to floating point errors would influence the 'if
(nlres_norm < tol_nl)' condition used to exist the nonlinear iteration.

Change these remaining cases to also use 'global_norm', leading to
bit-for-bit log reproducibility.

* ice_dyn_vp: remove unused subroutine and cleanup interfaces

The previous commit removed the last caller of 'calc_L2norm_squared'.
Remove the subroutine.

Also, do not compute 'sum_squared' in 'residual_vec', since the variable
'L2norm' which receives this value is also unused in 'anderson_solver'
since the previous commit. Remove that variable, and adjust the
interface of 'residual_vec' accordingly.

* ice_global_reductions: remove 'global_allreduce_sum'

In a previous commit, we removed the sole caller of
'global_allreduce_sum' (in ice_dyn_vp::orthogonalize). We do not
anticipate that function to be ued elsewhere in the code, so remove it
from ice_global_reductions. Update the 'sumchk' unit test accordingly.

* doc: mention VP solver is only reproducible using 'bfbflag'

The previous commits made sure that the model outputs as well as the log
file output are bit-for-bit reproducible when using the VP solver by
refactoring the code to use the existing 'global_sum' subroutine.

Add a note in the documentation mentioning that 'bfbflag' is required to
get bit-for-bit reproducible results under different decompositions /
MPI counts when using the VP solver.

Also, adjust the doc about 'bfbflag=lsum8' being the same as
'bfbflag=off' since this is not the case for the VP solver: in the first
case we use the scalar version of 'global_sum', in the second case we
use the array version.

* ice_dyn_vp: improve default parameters for VP solver

During QC testing of the previous commit, the 5 years QC test with the
updated VP solver failed twice with "bad departure points" after a few
years of simulation. Simply bumping the number of nonlinear iterations
(maxits_nonlin) from 4 to 5 makes these failures disappear and allow the
simulations to run to completion, suggesting the solution is not
converged enough with 4 iterations.

We also noticed that in these failing cases, the relative tolerance for
the linear solver (reltol_fmgres = 1E-2) is too small to be reached in
less than 50 iterations (maxits_fgmres), and that's the case at each
nonlinear iteration. Other papers mention a relative tolerance of 1E-1
for the linear solver, and using this value also allows both cases to
run to completion (even without changing maxits_nonlin).

Let's set the default tolerance for the linear solver to 1E-1, and let's
be conservative and bump the number of nonlinear iterations to 10. This
should give us a more converged solution and add robustness to the
default settings.
phil-blain added a commit to phil-blain/CICE that referenced this pull request Dec 12, 2022
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).

(cherry picked from commit c334aee)
phil-blain added a commit to phil-blain/CICE that referenced this pull request Feb 2, 2024
A few notes:
- We do need to request memory even on machines where we have excluseive
node access. 20 GB was chosen rather arbitrarily.
- We set umask to 022 to make the <jobname>.o and <jobname>.e files
readable by group and others.
- We use the minimal SSM package for the compiler and Intel MPI, but we
keep the setup using environment modules commented if ever we need to
weak things (i.e. I_MPI_LIBRARY_KIND)
- We set OMP_STACKSIZE. Since d1e972a (Update OMP (CICE-Consortium#680),
2022-02-18), OpenMP threading is active in 'ice_transport_remap.F90',
and the default OpenMP stack size needs to be adjusted to avoid stack
overflows. We set it to a 64 Mb size as used for other machines.

Also, remove dead code setting 'CICE_ACCT'. This variable was last used
in 98e0307 (Update scripts, rename variables from CICE_ to ICE_ to be
more reusable in icepack., 2017-09-15), and so did not do anything for
any of the machines that were using it after that commit. Remove code in
machines env files that was setting it based on '~/.cice_proj'.

A few notes specific to 'gpsc3':

- Since we use an '--export' directive to choose which environment
variables are exported to the job environment by SLURM, SSMUSE_BASE and
SSMUSE_PATH are not present in the environnement and loading domains
without their full paths fails on csh, so use a full path.
- We use the compiler package from main/opt instead of eccc/all/opt
since we do not need the EC-specific variables to be set (and it also
leads to job failures since BASE_ARCH is not defined).

(cherry picked from commit c334aee)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants