Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jgfouca/branch for acme split 2019 05 28 #3123

Merged
merged 49 commits into from
Jun 3, 2019

Conversation

jgfouca
Copy link
Contributor

@jgfouca jgfouca commented May 28, 2019

Change List:

  • Add first stage of ocean/ice shelf coupling
  • Bring COMPOSE semi-Lagrangian tracer transport code into HOMME and E3SM (required minor changes to build.py)
  • Add checkpoint of timing data immediately after initialization
  • bless_test_results: Be able to handle build-only tests

Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: BFB

Fixes [CIME Github issue #]

User interface changes?:

Update gh-pages html (Y/N)?:

Code review: @jedwards4b

Jeremy Fyke and others added 30 commits February 6, 2019 09:27
This commit introduces infrastructure changes needed for adding coupling
between the ocean and ice shelves in the GLC component.  It is a squash
of work done primarily by Jeremy Fyke between 2015 and 2017, with
updates by Jon Wolfe to make it compatible with changes to E3SM
since then. It is BFB for existing compsets.

This commit introduces the CIME changes, but does not touch the
components.

[BFB] - Bit-For-Bit
A GG case now runs to completion and appears to be calculating fluxes.
This ensures there are 0-values where floating ice is not present and
removes NaN that were present previously.
A previous commit was using ocean state fields on the ocean grid,
presumably as a holdover from when this melt calculation was done on the
ocean grid.  This commit fixes this by switching to the ocean state
fields remapped to the GLC grid.
This commit makes minor adjustments to the method for solving the
ocean-iceshelf interface temperature, following changes that had been
implemented in standalone MPAS-O in the last few years.  This makes
finding the root slightly more stable.
This commit makes a slight modification of the calculation of the ocean
freezing temperature beneath ice shelf cavities to account for
dependence on salinity.  This adjustment matches what is currently done
in MPAS-Ocean.

Also update parameter values in shr_const_mod to match exactly what is
in MPAS-Ocean.  Descriptions of those parameters are also added.
The melt flux calculation was using an incorrect variable for the ice
shelf pressure.
This forces the attribute vectors to be updated by the time that
prep_glc_calculate_subshelf_boundary_fluxes() is called.  Without this,
Sg_tbot and Sg_dztbot were still 0 when that routine was executed,
leading to a divide by zero.

There may be a subtler way to force the update, but this has the added
benefit of allowing these fields to be visualized on the ocean grid in
the coupler history file for debugging purposes.
Adds commented note in prep_glc_calculate_subshelf_boundary_fluxes
explaining the change made in 4092b4e

Also, comment out the rhoeff index for now - it is not currently used,
but will be implemented later.
Now both mass fluxes are positive for an addition of mass to the
component in question.
The call to cime_run_glc_setup_send was inadvertently deleted in
commit b24d24f

This commit reconciles changes in this branch with the contents of
cime_run_glc_setup_send.  Specifically, cime_run_glc_setup_send is
modified to separate steps needed for coupling from LND and OCN.  This
separation is also applied to prep_glc_accum_avg in prep_glc_mod.F90.
Finally, prep_glc_mrg and the routine it calls have been renamed to
indicate they are specific to forcing coming from LND.
This commits moves remaining details of the new OCN-GLC coupling from
the main cime_run() routine to subroutines.

A routine cime_run_ocnglc_coupling() does the actual calculations and is
called from cime_run_ocn_recv_post() to ensure it gets runs after each
time OCN completes.

This commit also creates separate prep_glc_accum_*() routines for
fields coming from LND and OCN so they can be called independently.
They are called from the respective cime_run_*_recv_post() routines for
hte component from which the fields are coming.

prep_glc_accum_avg() is still combined for fields coming from the two
components because this step happens in the same place for fields coming
from either LND or OCN.
The call from prep_ocn_mod.F90 is not needed.
This capability will be disabled for CESM.
This commit also corrects usage of glcshelf_c2_ocn instead of glc_c2_ocn in a couple places.
Whitespace, indentation, removal of unneeded comments, keyword arguments
for long argument lists.
In prep_glc_mod, there were two places where a list of coupling fields
from OCN were hardcoded in calls to seq_map_map.  This commit replaces
those hard-coded lists with a variable that is generated in seq_flds_mod
from the definitions of the coupling fields that are required.
These new coupling fields are disabled for CESM.
The OCN-GLC time averaging accumulators should only be read/written to
coupler restart files if OCN-GLC are actually coupled.  Otherwise there
will be an attempt to read/write unallocated things.

This required passing the ocn_c2_glc flag to the i/o routines, because
that information is not in infodata.
Currently a timing checkpoint is saved at the end of the first
simulation day, capturing both initialization cost and possibily
anomalous first simulation day performance. However, when
evaluating performance of initialization for high resolution
cases, this may not be sufficiently fine grain, especially if the
evaluation job runs out of time before completing the first
simulation day. Here the env_run.xml TPROF_IN_INIT variable is
introduced that enables saving timing checkpoints at
two locations within the initialization. (This adds a new namelist
variable, to seq_infodata_inparm in drv_in.) TPROF_IN_INIT is FALSE
by default, in which case no additional checkpoint data is output.
However, existing timer names associated with the writing out of
performance data are modified slightly, to better attribute time
associated with the new performance data writes.

[BFB] - Bit-For-Bit
[NML] - Namelist Changing
The addition of a namelist variable and env_run.xml support for enabling
optional checkpointing of performance data during intialization in the
previous commit is backed out here. Instead a new checkpoint timing file is
now output right before the beginning of the run loop, with a name consistent
with the other checkpoint timing files, i.e. including the simulation
timestamp in the suffix. This is not optional.

The internal reworking of the checkpoint write logic and tweaking of timer
names in cime_comp_mod.F90 is preserved, and missing documentation for the
TPROF_TOTAL env_run.xml variable is also added to config_component_e3sm.xml.

[BFB] - Bit-For-Bit
…heta-l.

Update E3SM namelist files with COMPOSE options.

Make the theta-l E3SM target build with COMPOSE.
@rljacob rljacob requested a review from billsacks May 28, 2019 17:01
@@ -233,7 +233,7 @@ def _build_libraries(case, exeroot, sharedpath, caseroot, cimeroot, libroot, lid
if mpilib == "mpi-serial":
libs.insert(0, mpilib)

if cam_target == "preqx_kokkos":
if cam_target in ("preqx_kokkos", "theta-l"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this and line 499 so that it only appears in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will do

call t_adj_detailf(+1)

call t_startf("sync1_tprf")
call mpi_barrier(ckpt_mpicom,ierr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code introduces an mpi_barrier to the cime_run loop - it could have serious performance implications. In other cases timing barriers are inside if (barrier_alarm) code blocks. It seems like this should be the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhoffman was that barrier needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rljacob , I'm not familiar with this code - it is not part of the PR from me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@worleyph made this change I think (7b1e744 in E3SM)

Copy link
Contributor

@worleyph worleyph May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same logic existed in-line previously. I just encapsulated it in a subroutine to cut down on duplicate code. It is called in one new location, just before the start of the run loop (but outside), immediately preceding the existing RUN_LOOP_BSTART barrier (which is annoying to have both, but seemed preferable to the other options I considered).
In summary, it is executed immediately after initialization (so once - a new feature, and the point of the PR), after the end of the first day (also once, has been this way for a long time), and whenever tprof_alarm is true (also for a long time). The barriers also occur during cime_final.

Copy link
Contributor

@worleyph worleyph May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has always been the possibility that components would write out their own checkpoints (or even do their timing independently). We have never supported that, but the design has always been in the back of my head, thus the use of a formal parameter and not hardwiring mpicom_GLOID. This is a subroutine, and the communicator is clearly a passed parameter. I don't find this confusing :-). You just have to check the call sites? So... my opinion is to leave it as is. However, do what you feel that you need to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd strongly prefer that this remain a formal parameter. There are a lot of comunicators floating around in cime_comp_mod.F90. It seems a strange design to hardwire GLOID in a subroutine that is otherwise pretty generic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why we might want to change it back to GLOID. If we want to search the code for everywhere a barrier is called on GLOID, it would be missed. Its not obvious from the name of the function that a barrier is going to called inside it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t_drvstartf, t_drvstopf also has the (optional) parameter to specify communicator to be used in a barrier, though it is t_barrierf and so not normally enabled. It also never uses GLOID (currently), but rather CPLID and omp(eci)%mpicom_compid, and mpicom_barrier (as called from component_exch, so even more once removed).

seq_domain_check_grid also has a barrier (and an allreduce) using an mpicom formal parameter.

Could make this an optional parameter that defaults to GLOID if the parameter is not specified, so grep would pick it up? But, aince all of the calls (all two of them) are local to the file cime_comp_mod.F90 at the moment, would not be hard to hunt down.

In general, if there is a need to identify all barrier (and allreduce?) calls using GLOID, I wouldn't depend on grep without further investigation of all barrier and allreduce calls, unless we eliminate all communicator parameters. So, my preference is the same, to leave as it, but will make changes if people request it (or you can make the changes yourself).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telecon: we are fine with this as is.

@@ -2399,6 +2410,155 @@ subroutine seq_flds_set(nmlfile, ID, infodata)
call set_glc_elevclass_field(name, attname, longname, stdname, units, x2l_fluxes_from_glc, &
additional_list = .true.)

if (trim(cime_model) == 'e3sm') then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the appropriate test here or should it be something about active glc model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b , originally @billsacks had wanted much of the OCN-GLC functionality enabled only in E3SM. In the end we settled on an implementation that avoids needing to know which model is being used and instead functionality is enabled by compset definition via the OCN init routine. As such, this test could potentially be removed if you and Bill prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm sort of back from vacation today, but still taking most of the day off, so may not respond to further discussion until tomorrow.)

As @matthewhoffman says, I requested this change after consultation with @mvertens . We wanted to ensure that these new fields don't impact CESM performance. I suspect that the new flag that @matthewhoffman is referring to isn't available at the time of seq_flds_set, so I don't think it will work to use that logical here (but I may be wrong).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billsacks is correct that the needed new flags are not available when seq_flds_set is called. seq_flds_set is called from cime_pre_init2, which occurs before cime_init which is the routine from which the new flags are set (in ocn_init). So I think the current implementation here is the best we can do to prevent these new coupler fields from affecting CESM without significant restructuring.

@jedwards4b jedwards4b requested review from Katetc and removed request for billsacks May 28, 2019 17:39
@jedwards4b
Copy link
Contributor

@billsacks is on vacation this week, I've asked @Katetc to review the glc changes.

@rljacob
Copy link
Member

rljacob commented May 28, 2019

@Katetc FYI, Bill looked carefully at the glc-ocean coupling code in its E3SM PR here: E3SM-Project/E3SM#2726

@billsacks
Copy link
Member

@Katetc There is no need for you to review these code changes unless you want to, since (as @rljacob says) they have already undergone a pretty careful review by myself and others. However, before this comes to master, can you please run testing to ensure that CESM tests pass and are bit-for-bit. I think we discussed this before, but I forget what testing we came up with. I'm now thinking aux_glc and prealpha; prebeta is probably unnecessary (unless we said differently before).

@jgfouca
Copy link
Contributor Author

jgfouca commented May 30, 2019

@billsacks , @jedwards4b , did anyone kick off tests? I'm getting some pressure to do a CIME merge to E3SM soon and it can't happen until this PR is merged.

@jedwards4b
Copy link
Contributor

@fischer-ncar can you please kick off tests with this PR merged to master?

@fischer-ncar
Copy link
Contributor

I'm getting the following error for all of my tests.

ERROR: Namelist default for variable glc2ocn_fmapname refers to unknown XML variable GLC2OCN_FMAPNAME.'

billsacks added 2 commits May 31, 2019 10:57
These are needed for some new drv namelist variables. When they were
just in config_component_e3sm.xml, CESM tests failed with a message like
this:

ERROR: Namelist default for variable glc2ocn_fmapname refers to unknown
XML variable GLC2OCN_FMAPNAME.'
@billsacks
Copy link
Member

@jgfouca @Katetc - The plan I worked out with @fischer-ncar is: I am doing our aux_glc testing today. Once I get that all passing, @fischer-ncar will run CESM prealpha tests.

I think I have fixed the problem @fischer-ncar ran into. I have pushed this and some other minor changes (fixing whitespace / indentation) to a branch on my fork (https://github.com/billsacks/cime/tree/jgfouca/branch-for-acme-split-2019-05-28). Once I get tests passing there, I'll hand off these changes.

@jgfouca how do you prefer that I hand these changes back to you? Should I simply merge these changes back to your branch here, or do you prefer something different?

@jgfouca
Copy link
Contributor Author

jgfouca commented May 31, 2019

@billsacks , yes, just merge them into this PR.

@billsacks
Copy link
Member

Okay, I have pushed two commits. The first just cleans up whitespace. The second fixes the problem @fischer-ncar ran into, by moving a bunch of xml variables from the e3sm-specific file to the shared file.

This now passes all aux_glc tests on cheyenne, and results are bit-for-bit. (Also, performance doesn't seem to be affected significantly from a few spot-checks.) This test suite covers I and T compsets. @fischer-ncar can you go ahead and retry prealpha tests with this branch?

@fischer-ncar
Copy link
Contributor

The prealpha testing found

ERROR: Need to provide valid mapping file between glc and ocn in xml variable glc2ocn_fmapname

For B, F, and X compsets.

These mapping files are only needed in certain configurations. We need
to use idmap_ignore so that the driver's buildnml won't abort for
configurations where these mapping files aren't provided and aren't
needed.
@billsacks
Copy link
Member

@fischer-ncar - I just pushed a fix for your issue

@jgfouca
Copy link
Contributor Author

jgfouca commented Jun 3, 2019

@billsacks , @fischer-ncar , how are we looking?

@billsacks billsacks requested review from billsacks and removed request for Katetc June 3, 2019 22:10
@billsacks
Copy link
Member

@jgfouca Chris's testing is still underway. Cheyenne (the machine, not the whole town, AFAIK) just went down due to weather problems, so it may be a while before full test results are in. But it looks like we're past the major issues at this point. @fischer-ncar and I agree that, if you're wanting to get this to master, you can go ahead and do so, and we'll follow up with additional issues/PRs if Chris runs into any more problems in his testing.

@jgfouca
Copy link
Contributor Author

jgfouca commented Jun 3, 2019

@billsacks , thanks.

@jgfouca jgfouca merged commit 5c43a15 into master Jun 3, 2019
@jgfouca jgfouca deleted the jgfouca/branch-for-acme-split-2019-05-28 branch June 3, 2019 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants