Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update CMEPS to allow bilinear ATM<->WAV mapping for global coupled application; utilize custom restart names for WW3 (was #1684) #1692

Merged
merged 15 commits into from
Apr 3, 2023

Conversation

DeniseWorthen
Copy link
Collaborator

@DeniseWorthen DeniseWorthen commented Apr 1, 2023

Description

Changes the mapping of state fields between ATM and WAV for the coupled model to use bilinear with nearest-source-to-destination filling.

A test case was run using the cpld_control_p8 test for 24 hours with and w/o this change. The following figure shows the difference in U10M along ~62S, from 90W:40W imported by the WAV model with the current mapping (mapnstod_consf, black line) vs the change in this PR (mapbilnr_nstod, red line).

Screen Shot 2023-03-29 at 10 03 35 AM

The impact on the Z0 imported by the ATM at ~0.5S,6E (on the coast of Brazil, where a large difference in Z0 is seen) for the current mapping (red) vs this PR (black) is shown below

Screen Shot 2023-03-29 at 10 25 09 AM

In the bmark test, the difference in Z0 imported by the ATM on tile 1 (scaled by 1.0e4) after 6 hours is shown below

Screen Shot 2023-03-29 at 11 20 05 AM

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:

This will change baselines for all coupled tests using ATM-WAV coupling. HAFS wave tests do not change since their mapping is mapfillv_bilnr and does not change.

Full RTs on cheyenne show the following:

GNU:

FAILED TESTS:
Test cpld_control_p8 047 failed in check_result failed
Test cpld_control_p8 047 failed in run_test failed
Test cpld_debug_p8 049 failed in check_result failed
Test cpld_debug_p8 049 failed in run_test failed

INTEL:

FAILED TESTS:
Test cpld_control_p8_mixedmode 001 failed in check_result failed
Test cpld_control_p8_mixedmode 001 failed in run_test failed
Test cpld_control_gfsv17 002 failed in check_result failed
Test cpld_control_gfsv17 002 failed in run_test failed
Test cpld_control_p8 003 failed in check_result failed
Test cpld_control_p8 003 failed in run_test failed
Test cpld_control_qr_p8 005 failed in check_result failed
Test cpld_control_qr_p8 005 failed in run_test failed
Test cpld_2threads_p8 007 failed in check_result failed
Test cpld_2threads_p8 007 failed in run_test failed
Test cpld_decomp_p8 008 failed in check_result failed
Test cpld_decomp_p8 008 failed in run_test failed
Test cpld_mpi_p8 009 failed in check_result failed
Test cpld_mpi_p8 009 failed in run_test failed
Test cpld_control_ciceC_p8 010 failed in check_result failed
Test cpld_control_ciceC_p8 010 failed in run_test failed
Test cpld_control_c192_p8 011 failed in check_result failed
Test cpld_control_c192_p8 011 failed in run_test failed
Test cpld_control_noaero_p8 013 failed in check_result failed
Test cpld_control_noaero_p8 013 failed in run_test failed
Test cpld_debug_p8 015 failed in check_result failed
Test cpld_debug_p8 015 failed in run_test failed
Test cpld_debug_noaero_p8 016 failed in check_result failed
Test cpld_debug_noaero_p8 016 failed in run_test failed

RegressionTests_cheyenne.gnu.log
RegressionTests_cheyenne.intel.log

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

DeniseWorthen and others added 4 commits March 28, 2023 22:27
* set configuration variable true to use non-default restart file
names in WW3
* change name of WW3 used for restart tests
* change name of WW3 in hafs wav tests
@DeniseWorthen DeniseWorthen changed the title update CMEPS to allow bilinear ATM<->WAV mapping for global coupled application update CMEPS to allow bilinear ATM<->WAV mapping for global coupled application; utilize custom restart names for WW3 (was #1684) Apr 1, 2023
@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 I had to recreate my CMEPS update PR and then re-merged the ww3 restart name change. This PR is equivalent to the previous PR #1685

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 1, 2023

Sure! @zach1221 @BrianCurtis-NOAA this pr replaces #1685. I am adding new bl date.

@jkbk2004 jkbk2004 added Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. jenkins-ci Jenkins CI: ORT build/test on docker container labels Apr 1, 2023
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 2, 2023

@DeniseWorthen two cases fail on jet: hafs_regional_datm_cdeps and regional_noquilt. I think regional_noquilt time out issue. But hafs_regional_datm_cdeps shows something hycom side in out file: error in zaiopf - can't open unit 13. can you check /lfs4/HFIP/h-nems/Jong.Kim/RT_RUNDIRS/Jong.Kim/FV3_RT/rt_52646/hafs_regional_datm_cdeps/out ? All ran ok on hera though.

@zach1221
Copy link
Collaborator

zach1221 commented Apr 2, 2023

Please see jenkins-ci logs attached. ORTs passed.
ufs-weather-model » ort-docker-pipeline » PR-1692 #1 Console [Jenkins].pdf

@DeniseWorthen
Copy link
Collaborator Author

@jkbk2004 I will look, but I suspect a jet system issue. This PR does not touch anything in hycom or cdeps.

@DeniseWorthen
Copy link
Collaborator Author

There is a timeout message in the hafs_regional_datm_cdeps err log also. It seems to have hung. But I will run the test independently to check.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

There is a timeout message in the hafs_regional_datm_cdeps err log also. It seems to have hung. But I will run the test independently to check.

I agree the time out case usually ended up with that hycom error message. Same thing occasionally happens even with develop branch. I think we need to turn those cases off on jet.

@BrianCurtis-NOAA
Copy link
Collaborator

We can't keep turning off tests that don't work. The issues need to be addressed with the machine admins and/or fixes in the UFSWM.

@DeniseWorthen
Copy link
Collaborator Author

@BrianCurtis-NOAA It does seem that jet is a flakey platform in general and it is difficult for either us or sys admins to debug intermittent issues. Generally we have the "2 tries" but even that doesn't seem to be enough for jet.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Apr 3, 2023

I don't think this causes the time-outs, but we should reduce the debug flags here. There is no reason for those to be anything other than 0 as a default. The higher settings simply report min/max values for import/export states for example. They do not actually turn on any sort of compiler-debug options.

nems.configure.hafs_atm_docn.IN:93:  dbug_flag = 20
nems.configure.hafs_atm_ocn.IN:122:  dbug_flag = 20
nems.configure.hafs_atm_ocn_wav.IN:134:  dbug_flag = 6
nems.configure.hafs_atm_wav.IN:96:  dbug_flag = 6

@DeniseWorthen
Copy link
Collaborator Author

The current jet log shows this test taking almost 24 minutes. On Cheyenne.intel, it takes a 21 minutes. I would suggest we reduce the fhmax for this test to 12 or even 6. It is currently 24 but all the other HAFS tests are only 6 hours.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

The current jet log shows this test taking almost 24 minutes. On Cheyenne.intel, it takes a 21 minutes. I would suggest we reduce the fhmax for this test to 12 or even 6. It is currently 24 but all the other HAFS tests are only 6 hours.

Sounds like consistent 6 hour across hafs tests makes a sense. same way for regional_noquilt, right?

@DeniseWorthen
Copy link
Collaborator Author

I think reducing fhmax may work sometimes, but I just ran a the hafs_regional_datm_cdeps case on Jet w/ fhmax=12 and it didn't even finish 12 hours. It was a half-hour short when it ran out of wall clock. I just think there are just system issues w/ Jet.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

On jet, these 5 cases are continuously facing the time-out issue: regional_noquilt hafs_regional_datm_cdeps regional_wofs regional_atmaq regional_atmaq_faster. @DeniseWorthen @BrianCurtis-NOAA sounds like turning off all regional aqm cases on jet, too much? Other than that, Jet RT log is available to push.

@DeniseWorthen
Copy link
Collaborator Author

I already pushed a commit to remove regional_noquilt and hafs_regional_datm_cdeps for jet.intel. Now the other regional tests are timing-out?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

I already pushed a commit to remove regional_noquilt and hafs_regional_datm_cdeps for jet.intel. Now the other regional tests are timing-out?

Yes, I keep seeing those 3 other cases continue to hit the time limit. I vote to turn those off if possible. We will create an issue and revisit the issue along with that.

@DeniseWorthen
Copy link
Collaborator Author

I will push a commit to remove regional_wofs, regional_atmaq, regional_atmaq_faster from jet.intel.

DeniseWorthen and others added 2 commits April 3, 2023 15:34
* turn off regional_wofs, regional_atmaq and regional_atmaq_faster
on jet.intel
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

I am writing an issue to address the jet time-out cases. We can start merging process. @DeniseWorthen @BrianCurtis-NOAA Can you go ahead to merge in CMEPS PR?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

issue was created #1695

@jkbk2004 jkbk2004 self-requested a review April 3, 2023 20:42
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

All set! @BrianCurtis-NOAA @SadeghTabas-NOAA please, go ahead to approve the pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
5 participants