Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/cam fixes 3 #597

Merged
merged 26 commits into from
Nov 12, 2024
Merged

Conversation

MarcelCaron-NOAA
Copy link
Contributor

@MarcelCaron-NOAA MarcelCaron-NOAA commented Oct 25, 2024

Note to developers: You must use this PR template!

Description of Changes

Please include a summary of the changes and the related GitHub issue(s). Please also include relevant motivation and context.

  • Bugfix: EVS Bugzilla 1547, i.e. all mpmd processes output to separate working directories, and output are merged to the main DATA directory after parallelization is complete. Outcome: safer parallelization, and alignment with NCO code standards.
  • Bugfix: a bug causing duplicate lines in final stats files. Outcome: substantially reduced stats file sizes.
  • Bugfix: incorrect HRRR TCDC field configuration. Outcome: an accurate TCDC stat.
  • Bugfix: Various minor resource adjustments
  • Enhancement: differentiated HRRR ASNOW and SNOD verification (already done in the upstream EVS-RRFS release branch)
  • Enhancement: added CAPE and ASNOW graphics
  • Enhancement: adjusted 3-h Precipitation and 6-hr Snowfall (WEASD, SNOD, ASNOW) graphics so that 24-hour periods are more accurately covered, and extended forecast verification is limited to models with leads out to F60.

Developer Questions and Checklist

  • Is this a high priority PR? If so, why and is there a date it needs to be merged by?

No

  • Do you have any planned upcoming annual leave/PTO?

No

  • Are there any changes needed for when the jobs are supposed to run?

No

  • The code changes follow NCO's EE2 Standards.
  • Developer's name is removed throughout the code and have used ${USER} where necessary throughout the code.
  • References the feature branch for HOMEevs are removed from the code.
  • J-Job environment variables, COMIN and COMOUT directories, and output follow what has been defined for EVS.
  • Jobs over 15 minutes in runtime have restart capability.
  • If applicable, changes in the dev/drivers/scripts or dev/modulefiles have been made in the corresponding ecf/scripts and ecf/defs/evs-nco.def?
  • Jobs contain the approriate file checking and don't run METplus for any missing data.
  • Code is using METplus wrappers structure and not calling MET executables directly.
  • Log is free of any ERRORs or WARNINGs.

Testing Instructions

Please include testing instructions for the PR assignee. Include all relevant input datasets needed to run the tests.

(1) Set up jobs

  • symlink the EVS_fix directory locally as "fix"
  • In each driver script, edit the following environment variables:
    HOMEevs - set to your test EVS directory
    COMIN - set to /lfs/h2/emc/vpppg/noscrub/emc.vpppg/${NET}_beta5/$evs_ver_2d
    COMOUT - set to your test output directory
    SENDMAIL - (optional) set to "NO"
    DATAROOT - (optional) set to your test DATAROOT directory

(2) Running jobs

I recommend testing the following jobs:

jevs_cam_hireswarw_severe_prep
jevs_cam_hireswarwmem2_severe_prep
jevs_cam_hireswfv3_severe_prep
jevs_cam_hrrr_severe_prep
jevs_cam_namnest_severe_prep
jevs_cam_hireswarw_precip_prep
jevs_cam_hireswarwmem2_precip_prep
jevs_cam_hireswfv3_precip_prep
jevs_cam_hrrr_precip_prep
jevs_cam_namnest_precip_prep
jevs_cam_hireswarw_grid2obs_stats
jevs_cam_hireswarwmem2_grid2obs_stats
jevs_cam_hireswfv3_grid2obs_stats
jevs_cam_hrrr_grid2obs_stats
jevs_cam_namnest_grid2obs_stats
jevs_cam_hireswarw_precip_stats
jevs_cam_hireswarwmem2_precip_stats
jevs_cam_hireswfv3_precip_stats
jevs_cam_hrrr_precip_stats
jevs_cam_namnest_precip_stats
jevs_cam_hireswarw_snowfall_stats
jevs_cam_hireswarwmem2_snowfall_stats
jevs_cam_hireswfv3_snowfall_stats
jevs_cam_hrrr_snowfall_stats
jevs_cam_namnest_snowfall_stats
jevs_cam_hireswarw_radar_stats
jevs_cam_hireswarwmem2_radar_stats
jevs_cam_hireswfv3_radar_stats
jevs_cam_hrrr_radar_stats
jevs_cam_namnest_radar_stats
jevs_cam_hireswarw_severe_stats
jevs_cam_hireswarwmem2_severe_stats
jevs_cam_hireswfv3_severe_stats
jevs_cam_hrrr_severe_stats
jevs_cam_href_severe_stats
jevs_cam_namnest_severe_stats
jevs_cam_headline_plots
jevs_cam_grid2obs_last31days_plots
jevs_cam_grid2obs_last90days_plots
jevs_cam_precip_last31days_plots
jevs_cam_precip_last90days_plots
jevs_cam_snowfall_plots

[Total: 10 prep jobs; 26 stats jobs; 6 plots jobs]

  • All precip stats driver scripts can be submitted using qsub -v vhr=21,VDATE=$(date -d "-3 days" +"%Y%m%d") $driver
  • All other driver scripts can be submitted using qsub -v vhr=00 $driver
  • All jobs can be submitted at any time

(3) Checking jobs

  • Log files should be checked by the developer for the following keywords:
check="FATAL\|WARNING\|error\|Killed\|Cgroup\|argument expected\|No such file\|cannot\|failed\|unexpected\|exceeded"
grep "$check" $logfile

@malloryprow malloryprow self-assigned this Oct 28, 2024
@malloryprow malloryprow added the enhancement New feature or request label Oct 28, 2024
@malloryprow malloryprow added this to the EVS v2.0.0 milestone Oct 28, 2024
@malloryprow
Copy link
Contributor

I started commenting on the lines but realized it was in quite a few places so leaving a general comment here for a change.

In the ex-scripts, there is the copying of the child directories output to a main output directory using cp -ru. Could we do cp -ruv? The -v is the verbose option.

@malloryprow
Copy link
Contributor

I see the testing for the prep involves the severe dev drivers but nothing for the precip. The precip prep jobs are suspected jobs using MPMD, which this PR says it is addressing (#551). Are these jobs using MPMD?

@MarcelCaron-NOAA
Copy link
Contributor Author

I see the testing for the prep involves the severe dev drivers but nothing for the precip. The precip prep jobs are suspected jobs using MPMD, which this PR says it is addressing (#551). Are these jobs using MPMD?

@malloryprow You're right, and yes these jobs use MPMD. I must have missed them when grepping for the mpi command. I'll add and test the fix for those five jobs and then update this thread.

@MarcelCaron-NOAA
Copy link
Contributor Author

I started commenting on the lines but realized it was in quite a few places so leaving a general comment here for a change.

In the ex-scripts, there is the copying of the child directories output to a main output directory using cp -ru. Could we do cp -ruv? The -v is the verbose option.

Yes, just added that change in 01928c7

@malloryprow
Copy link
Contributor

Thank you!

@malloryprow
Copy link
Contributor

stats - radar

Submitted jobs with vhr=00. COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/evs/v2.0/stats/cam.

jevs_cam_hireswarw_radar_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarw_radar_stats.o160002551
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hireswarw_radar_stats.160002551.cbqs01

jevs_cam_hireswarwmem2_radar_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarwmem2_radar_stats.o160002553
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hireswarwmem2_radar_stats.160002553.cbqs01

jevs_cam_hireswfv3_radar_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswfv3_radar_stats.o160002554
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hireswfv3_radar_stats.160002554.cbqs01

jevs_cam_hrrr_radar_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hrrr_radar_stats.o160002561
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hrrr_radar_stats.160002561.cbqs01

jevs_cam_namnest_radar_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_namnest_radar_stats.o160002563
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_namnest_radar_stats.160002563.cbqs01

@MarcelCaron-NOAA
Copy link
Contributor Author

@malloryprow All of the snowfall and radar jobs completed cleanly and output look normal:
✔️ jevs_cam_hireswarw_snowfall_stats
✔️ jevs_cam_hireswarwmem2_snowfall_stats
✔️ jevs_cam_hireswfv3_snowfall_stats
✔️ jevs_cam_hrrr_snowfall_stats
✔️ jevs_cam_namnest_snowfall_stats
✔️ jevs_cam_hireswarw_radar_stats
✔️ jevs_cam_hireswarwmem2_radar_stats
✔️ jevs_cam_hireswfv3_radar_stats
✔️ jevs_cam_hrrr_radar_stats
✔️ jevs_cam_namnest_radar_stats

👍 No other concerns from me about stats jobs tested in this PR

@malloryprow
Copy link
Contributor

Resources look good for snowfall, but I'd like to adjust the radar ones.

They are all requesting 500 GB of memory so it is taking the whole node for memory but only using npus=3. But the most a job uses is ~21GB. If we set it to 50 GB we can set place=vscatter too and other jobs in the queue can share on the node. I don't want to take the whole node if it isn't needed!

@MarcelCaron-NOAA
Copy link
Contributor Author

Thanks @malloryprow! Done 👍

@malloryprow
Copy link
Contributor

plots

Submitted jobs with vhr=00. COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/evs/v2.0/stats/cam.

jevs_cam_headline_plots

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_headline_plots.o160010034
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01

jevs_cam_grid2obs_last31days_plots

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_grid2obs_plots_last31days.o160010036
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_grid2obs_plots_last31days.160010036.cbqs01

jevs_cam_grid2obs_last90days_plots

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_grid2obs_plots_last90days.o160010037
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_grid2obs_plots_last90days.160010037.cbqs01

jevs_cam_precip_last31days_plots

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_precip_plots_last31days.o160010039
DATA:/lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_precip_plots_last31days.160010039.cbqs01

jevs_cam_precip_last90days_plots

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_precip_plots_last90days.o160010040
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_precip_plots_last90days.160010040.cbqs01

jevs_cam_snowfall_plots

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_snowfall_plots.o160010046
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_snowfall_plots.160010046.cbqs01

@malloryprow
Copy link
Contributor

I'm looking /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01. Are the jobs each writing a log file to /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01/headline/out?

@malloryprow
Copy link
Contributor

I'm looking /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01. Are the jobs each writing a log file to /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01/headline/out?

Oh wait, are they getting copied from their working directory to that location via this? exevs_cam_headline_plots.sh lines 77-81.

# Copy Plots Output to Main Directory
for CHILD_DIR in ${DATA}/${VERIF_CASE}/out/workdirs/*; do
    cp -ruv $CHILD_DIR/* ${DATA}/${VERIF_CASE}/out/.
    export err=$?; err_chk
done

@MarcelCaron-NOAA
Copy link
Contributor Author

@malloryprow yes that's right, that block of code copies the log files and graphics from "workdirs" to the main directory after plotting is complete.

Child processes write to both $DATA/$VERIF_CASE/out/workdirs/jobX and $DATA/$VERIF_CASE/data/etc/etc/. The out/ directories are needed in main and are each copied back at the end of all processes via that block of code. The data/ directories are temp directories and are not copied back.

@MarcelCaron-NOAA
Copy link
Contributor Author

@malloryprow The following completed cleanly and output are as expected:
✔️ jevs_cam_headline_plots
✔️ jevs_cam_precip_plots_last31days
✔️ jevs_cam_precip_plots_last90days
✔️ jevs_cam_snowfall_plots

The following are still running:
⏳ jevs_cam_grid2obs_plots_last31days
⏳ jevs_cam_grid2obs_plots_last90days

Note on recent commits: I adjusted some ecf resource configs to match dev. I also added a mem spec for jevs_cam_headline_plots.

@MarcelCaron-NOAA
Copy link
Contributor Author

@malloryprow The remaining grid2obs plots jobs have completed cleanly and output files look good:
✔️ jevs_cam_grid2obs_plots_last31days
✔️ jevs_cam_grid2obs_plots_last90days

Note: small Cgroup mem limit exceedence on the last31days job. I increased the memory allocation closer to what is allocated for the last90days job, so it should be good!

👍 no other concerns from me about any of the test jobs listed for this PR

@malloryprow
Copy link
Contributor

Great news!

Since this PR addresses the NCO MPMD Bugzilla, can you confirm that this fixes that for all the job in #551? If so, that can be closed when this is merged.

Copy link

@AndrewBenjamin-NOAA AndrewBenjamin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barring any additional issues, I approve this PR. Thank you for your diligence in getting this important PR in and for adjusting resources as needed.

@MarcelCaron-NOAA
Copy link
Contributor Author

MarcelCaron-NOAA commented Nov 4, 2024

Hi @malloryprow I can confirm this fixes the MPMD Bugzilla in all of the listed jobs except for the following:

stats/cam/jevs_cam_hireswarw_severe_stats.sh
stats/cam/jevs_cam_hireswarwmem2_severe_stats.sh
stats/cam/jevs_cam_hireswfv3_severe_stats.sh
stats/cam/jevs_cam_hrrr_severe_stats.sh
stats/cam/jevs_cam_namnest_severe_stats.sh

I didn't find that these jobs were running any processes in parallel!

@malloryprow
Copy link
Contributor

Thanks for checking! We may want to fix the resources on those:

stats/cam/jevs_cam_hireswarw_severe_stats.sh: select=1:ncpus=5
stats/cam/jevs_cam_hireswarwmem2_severe_stats.sh: select=1:ncpus=5
stats/cam/jevs_cam_hireswfv3_severe_stats.sh: select=1:ncpus=5
stats/cam/jevs_cam_hrrr_severe_stats.sh: select=1:ncpus=5
stats/cam/jevs_cam_namnest_severe_stats.sh: select=1:ncpus=5

They should be select=1:ncpus=1 if they aren't running anything in parallel.

@malloryprow
Copy link
Contributor

With the resource change, I'll want to test those jobs when we get WCOSS2 back.

@MarcelCaron-NOAA
Copy link
Contributor Author

Ah right— I've changed the ncpus to one.

Yes agreed we should test those jobs. I'm adding them to the PR instructions. Thanks

@malloryprow
Copy link
Contributor

malloryprow commented Nov 12, 2024

jevs_cam_hireswarw_severe_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarw_severe_stats_00.o160997877
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hireswarw_severe_stats_00.160997877.cbqs01

jevs_cam_hireswarwmem2_severe_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarwmem2_severe_stats_00.o160997869
DATA:/lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hireswarwmem2_severe_stats_00.160997869.cbqs01

jevs_cam_hireswfv3_severe_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswfv3_severe_stats_00.o160997896
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hireswfv3_severe_stats_00.160997896.cbqs01

jevs_cam_hrrr_severe_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hrrr_severe_stats_00.o160997917
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_hrrr_severe_stats_00.160997917.cbqs01

jevs_cam_namnest_severe_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_namnest_severe_stats_00.o160997925
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_namnest_severe_stats_00.160997925.cbqs01

@MarcelCaron-NOAA
Copy link
Contributor Author

Hi @malloryprow. All jobs ran as expected and output files look normal!
✔️ jevs_cam_hireswarw_severe_stats
✔️ jevs_cam_hireswarwmem2_severe_stats
✔️ jevs_cam_hireswfv3_severe_stats
✔️ jevs_cam_hrrr_severe_stats
✔️ jevs_cam_namnest_severe_stats

@malloryprow
Copy link
Contributor

Awesome! I believe that this PR is all wrapped up!

@malloryprow malloryprow merged commit c2451c5 into NOAA-EMC:develop Nov 12, 2024
@malloryprow
Copy link
Contributor

Thanks for these changes and working with me on adjustments, @MarcelCaron-NOAA! Cross anything of on the Fixes and Additions document this fixes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request NCO Bugzilla
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cam - det: Address Bugzilla 1547 - MPMD processes share the same working directory
4 participants