Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rt hires #1471

Merged
merged 30 commits into from
Nov 10, 2022
Merged

Rt hires #1471

merged 30 commits into from
Nov 10, 2022

Conversation

RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA commented Oct 25, 2022

PR Checklist

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.
    Issue Replacing regional regression tests with more realistic ones #1461

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.
    New tests make new results.
    Should create new develop-2022XXXX directory

  • New input data is required and created on Hera machine. Should be resynched to other platforms.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsibility to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

Replaced old regional regression tests with new, using latest configuration from LAMDX parallel. It is in 3km resolution over small domain.
@DusanJovic-NOAA removed variable "calendar" from namelists (model_configure*.IN) and replaced calendar variable values in input directory in files coupler.res from 2 --> 3
(Calendar: no_calendar=0, thirty_day_months=1, julian=2, gregorian=3, noleap=4)

Issue(s) addressed

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • acorn.intel
  • opnReqTest for newly added/changed feature
  • CI

Dependencies

No dependences

@DeniseWorthen DeniseWorthen added Baseline Updates Current baselines will be updated. New Input Data Req'd This PR requires new data to be sync across platforms labels Oct 25, 2022
@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Oct 26, 2022

I think we can remove the following directories from the new input-data:

FV3_input_data_gsd/FV3_input_data_C96_with_aerosols/
FV3_input_data_regional_esg/
FV3_input_data_sar/
GOCART/p7/
GOCART/ExtData/
WW3_input_data_20211113/
WW3_input_data_20220418/

most of the files from AQM/NEXUS - confirmed by @BrianCurtis-NOAA only NEXUS_Expt.nc is needed

@RatkoVasic-NOAA
Copy link
Collaborator Author

I an remove those directories from my input-data-20221101 directory on Hera and run full tests.
Who can confirm that directories GOCART and WW3 are not in use?

@RatkoVasic-NOAA
Copy link
Collaborator Author

I removed those directories and ran tests on Hera. Tests PASSED.

Copy link
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/rt.sh file permissions has been changed from 755 to 644. Please revert.

@RatkoVasic-NOAA
Copy link
Collaborator Author

tests/rt.sh file permissions has been changed from 755 to 644. Please revert.

I guess it happens when you solve conflict using github's web based tools.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Nov 4, 2022

@RatkoVasic-NOAA I may start rsync if input-data-20221101 is ready on hera.

@RatkoVasic-NOAA
Copy link
Collaborator Author

@jkbk2004 I see two new directories in input-data-20220414 :
DATM_GSWP3_input_data
NOAHMP_IC
Are they added recently?
Should I add those to input-data-20221101?

@RatkoVasic-NOAA
Copy link
Collaborator Author

@jkbk2004
OK, I see they are used in datm_cdeps_lnd_gswp3 and datm_cdeps_lnd_gswp3_rst.
I'll copy those directories on Hera and WCOSS2, and you can rsync to othermachines.

@uturuncoglu
Copy link
Collaborator

@RatkoVasic-NOAA Yes, those are added as a part of external land component PR.

@DeniseWorthen DeniseWorthen mentioned this pull request Nov 7, 2022
@DusanJovic-NOAA
Copy link
Collaborator

With the above change in fv3_slurm.IN_jet the regional_2threads run finished in ~260sec.

@DusanJovic-NOAA
Copy link
Collaborator

I pushed the change to Ratko's branch. Please rerun the test on Jet.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Nov 9, 2022

I pushed the change to Ratko's branch. Please rerun the test on Jet.

@DusanJovic-NOAA Great! do you think --npernode will same thing for cheyenne's mpiexec_mpt? let me try.

@BrianCurtis-NOAA
Copy link
Collaborator

I pushed the change to Ratko's branch. Please rerun the test on Jet.

@DusanJovic-NOAA is this something we should be doing on all machines we run that test on?

@DusanJovic-NOAA
Copy link
Collaborator

I pushed the change to Ratko's branch. Please rerun the test on Jet.

@DusanJovic-NOAA is this something we should be doing on all machines we run that test on?

We can test this change on other machines that are using slurm scheduler.

@DusanJovic-NOAA
Copy link
Collaborator

I pushed the change to Ratko's branch. Please rerun the test on Jet.

@DusanJovic-NOAA is this something we should be doing on all machines we run that test on?

We can test this change on other machines that are using slurm scheduler.

I added --cpus-per-task=@[THRD] to hera job card template and ran 4 tests, control, control_2threads, regional and regional_2threads, and all tests passed.

@jkbk2004
Copy link
Collaborator

I am wondering one issue about the setup of TASKS/INPES/JNPES of regional_control/regional_2thread. On cheyenne, regional_control job_card sets #PBS -l select=4:ncpus=36:mpiprocs=36 and mpiexec_mpt -p %g: -np 144 ./fv3.exe and
regional_2thread job_card sets #PBS -l select=7:ncpus=18:mpiprocs=18 and mpiexec_mpt -p %g: -np 126 ./fv3.exe. This doesn't makes a sense. Another problem is both cases set with export TASKS=120. The numbers don't match. @RatkoVasic-NOAA @DusanJovic-NOAA @BrianCurtis-NOAA It could be a bug on setting the cases on cheyenne.

@RatkoVasic-NOAA
Copy link
Collaborator Author

I don't know who set up cheyenne parameters originally. Most of us here don't have access to that machine.
Yes, those numbers look odd.
Can we just skip that test on cheyenne in rt.conf?

@jkbk2004
Copy link
Collaborator

I agree to turn off regional_2thread on cheyenne and move on for merging. I am going to create an issue about this. If all agrees, then probably we can start merging tomorrow morning. @DusanJovic-NOAA @BrianCurtis-NOAA what do you think?

@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Nov 10, 2022

I am wondering one issue about the setup of TASKS/INPES/JNPES of regional_control/regional_2thread. On cheyenne, regional_control job_card sets #PBS -l select=4:ncpus=36:mpiprocs=36 and mpiexec_mpt -p %g: -np 144 ./fv3.exe and regional_2thread job_card sets #PBS -l select=7:ncpus=18:mpiprocs=18 and mpiexec_mpt -p %g: -np 126 ./fv3.exe. This doesn't makes a sense. Another problem is both cases set with export TASKS=120. The numbers don't match. @RatkoVasic-NOAA @DusanJovic-NOAA @BrianCurtis-NOAA It could be a bug on setting the cases on cheyenne.

The number of processes (-np) is calculated by multiplying number of nodes with number_of_tasks_per_node. So for regional_control that is 4*36 = 144, for regional_2threads it is 7*18 = 126.

wcoss2, is also using PBS. Try to set some of the pbs options specified there, specifically 'place=vscatter' and mpiexec options '-ppn @[TPN] -depth @[THRD]'. I do not know if mpiexec_mpt ob cheyenne is the same as mpiexec on wcoss.

I also do not have access to cheyenne.

@BrianCurtis-NOAA
Copy link
Collaborator

Let me give it a go on Cheyenne before we write off the RT on Cheyenne.

@BrianCurtis-NOAA
Copy link
Collaborator

diff --git a/tests/fv3_conf/fv3_qsub.IN_cheyenne b/tests/fv3_conf/fv3_qsub.IN_cheyenne
index 2cc648b8..b87293ce 100644
--- a/tests/fv3_conf/fv3_qsub.IN_cheyenne
+++ b/tests/fv3_conf/fv3_qsub.IN_cheyenne
@@ -4,7 +4,7 @@
 #PBS -N @[JBNME]
 #PBS -A @[ACCNR]
 #PBS -q @[QUEUE]
-#PBS -l select=@[NODES]:ncpus=@[TPN]:mpiprocs=@[TPN]
+#PBS -l select=@[NODES]:ncpus=@[TPN]:mpiprocs=@[TPN]:ompthreads=@[THRD]
 #PBS -l walltime=00:@[WLCLK]:00
 
 set -eux
@@ -34,7 +34,7 @@ export ESMF_RUNTIME_PROFILE_OUTPUT="SUMMARY"
 # Avoid job errors because of filesystem synchronization delays
 sync && sleep 1
 
-mpiexec_mpt -p %g: -np @[TASKS] ./fv3.exe
+mpiexec_mpt -p %g: -np @[TASKS] omplace ./fv3.exe
 
 echo "Model ended:    " `date`
 echo -n " $( date +%s )," >> job_timestamp.txt

@BrianCurtis-NOAA
Copy link
Collaborator

@jkbk2004 Please try those changes with a few tests and re-run the regional_2threads as well for the logs.

@jkbk2004
Copy link
Collaborator

@jkbk2004 Please try those changes with a few tests and re-run the regional_2threads as well for the logs.

sure!

@jkbk2004
Copy link
Collaborator

With @BrianCurtis-NOAA contribution of omplace, regional_2threads passes ok now. @DusanJovic-NOAA @RatkoVasic-NOAA @ChunxiZhang-NOAA this pr is ready.

@jkbk2004 jkbk2004 merged commit ad8da86 into ufs-community:develop Nov 10, 2022
@RatkoVasic-NOAA RatkoVasic-NOAA deleted the rt-hires branch November 10, 2022 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container New Input Data Req'd This PR requires new data to be sync across platforms
Projects
None yet
7 participants