-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/cam fixes 3 #597
Bugfix/cam fixes 3 #597
Conversation
(severe and radar plots jobs)
I started commenting on the lines but realized it was in quite a few places so leaving a general comment here for a change. In the ex-scripts, there is the copying of the child directories output to a main output directory using |
I see the testing for the prep involves the severe dev drivers but nothing for the precip. The precip prep jobs are suspected jobs using MPMD, which this PR says it is addressing (#551). Are these jobs using MPMD? |
@malloryprow You're right, and yes these jobs use MPMD. I must have missed them when grepping for the mpi command. I'll add and test the fix for those five jobs and then update this thread. |
Yes, just added that change in 01928c7 |
Thank you! |
(precip prep jobs)
stats - radarSubmitted jobs with jevs_cam_hireswarw_radar_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarw_radar_stats.o160002551 jevs_cam_hireswarwmem2_radar_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarwmem2_radar_stats.o160002553 jevs_cam_hireswfv3_radar_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswfv3_radar_stats.o160002554 jevs_cam_hrrr_radar_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hrrr_radar_stats.o160002561 jevs_cam_namnest_radar_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_namnest_radar_stats.o160002563 |
@malloryprow All of the snowfall and radar jobs completed cleanly and output look normal: 👍 No other concerns from me about stats jobs tested in this PR |
Resources look good for snowfall, but I'd like to adjust the radar ones. They are all requesting 500 GB of memory so it is taking the whole node for memory but only using npus=3. But the most a job uses is ~21GB. If we set it to 50 GB we can set place=vscatter too and other jobs in the queue can share on the node. I don't want to take the whole node if it isn't needed! |
Thanks @malloryprow! Done 👍 |
plotsSubmitted jobs with vhr=00. COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/evs/v2.0/stats/cam. jevs_cam_headline_plotsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_headline_plots.o160010034 jevs_cam_grid2obs_last31days_plotsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_grid2obs_plots_last31days.o160010036 jevs_cam_grid2obs_last90days_plotsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_grid2obs_plots_last90days.o160010037 jevs_cam_precip_last31days_plotsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_precip_plots_last31days.o160010039 jevs_cam_precip_last90days_plotsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_precip_plots_last90days.o160010040 jevs_cam_snowfall_plotsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/plots/cam/jevs_cam_snowfall_plots.o160010046 |
I'm looking /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01. Are the jobs each writing a log file to /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_cam_headline_plots.160010034.cbqs01/headline/out? |
Oh wait, are they getting copied from their working directory to that location via this? exevs_cam_headline_plots.sh lines 77-81.
|
@malloryprow yes that's right, that block of code copies the log files and graphics from "workdirs" to the main directory after plotting is complete. Child processes write to both $DATA/$VERIF_CASE/out/workdirs/jobX and $DATA/$VERIF_CASE/data/etc/etc/. The out/ directories are needed in main and are each copied back at the end of all processes via that block of code. The data/ directories are temp directories and are not copied back. |
@malloryprow The following completed cleanly and output are as expected: The following are still running: Note on recent commits: I adjusted some ecf resource configs to match dev. I also added a mem spec for jevs_cam_headline_plots. |
@malloryprow The remaining grid2obs plots jobs have completed cleanly and output files look good: Note: small Cgroup mem limit exceedence on the last31days job. I increased the memory allocation closer to what is allocated for the last90days job, so it should be good! 👍 no other concerns from me about any of the test jobs listed for this PR |
Great news! Since this PR addresses the NCO MPMD Bugzilla, can you confirm that this fixes that for all the job in #551? If so, that can be closed when this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Barring any additional issues, I approve this PR. Thank you for your diligence in getting this important PR in and for adjusting resources as needed.
Hi @malloryprow I can confirm this fixes the MPMD Bugzilla in all of the listed jobs except for the following:
I didn't find that these jobs were running any processes in parallel! |
Thanks for checking! We may want to fix the resources on those: stats/cam/jevs_cam_hireswarw_severe_stats.sh: They should be |
With the resource change, I'll want to test those jobs when we get WCOSS2 back. |
Ah right— I've changed the ncpus to one. Yes agreed we should test those jobs. I'm adding them to the PR instructions. Thanks |
jevs_cam_hireswarw_severe_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarw_severe_stats_00.o160997877 jevs_cam_hireswarwmem2_severe_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswarwmem2_severe_stats_00.o160997869 jevs_cam_hireswfv3_severe_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hireswfv3_severe_stats_00.o160997896 jevs_cam_hrrr_severe_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_hrrr_severe_stats_00.o160997917 jevs_cam_namnest_severe_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr597/EVS/dev/drivers/scripts/stats/cam/jevs_cam_namnest_severe_stats_00.o160997925 |
Hi @malloryprow. All jobs ran as expected and output files look normal! |
Awesome! I believe that this PR is all wrapped up! |
Thanks for these changes and working with me on adjustments, @MarcelCaron-NOAA! Cross anything of on the Fixes and Additions document this fixes! |
Note to developers: You must use this PR template!
Description of Changes
Developer Questions and Checklist
No
No
No
${USER}
where necessary throughout the code.HOMEevs
are removed from the code.dev/drivers/scripts
ordev/modulefiles
have been made in the correspondingecf/scripts
andecf/defs/evs-nco.def
?Testing Instructions
(1) Set up jobs
HOMEevs - set to your test EVS directory
COMIN - set to
/lfs/h2/emc/vpppg/noscrub/emc.vpppg/${NET}_beta5/$evs_ver_2d
COMOUT - set to your test output directory
SENDMAIL - (optional) set to
"NO"
DATAROOT - (optional) set to your test DATAROOT directory
(2) Running jobs
I recommend testing the following jobs:
[Total: 10 prep jobs; 26 stats jobs; 6 plots jobs]
qsub -v vhr=21,VDATE=$(date -d "-3 days" +"%Y%m%d") $driver
qsub -v vhr=00 $driver
(3) Checking jobs