-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C192L127 forecast hung with corrupted RESTART tiles (Hera) #709
Comments
@RussTreadon-NOAA @HaixiaLiu-NOAA In order to implement a complete |
Thanks, @aerorahul , for replying to this issue. It is not clear if file corruption points to a machine (Hera) issue, a supporting library, or something in the forecast model. I agree. As this case demonstrates file existence is an insufficient validator. Perhaps UFS modelers have ideas for an efficient and computationally inexpensive validator. Please feel free to close this issue if resolution lies outside g-w. |
The issue is specific to a RESTART tile, but should be resolved with a general file validator. |
* new global_nest_v1 suite * switch to ugwpv1 in global_nest_v1 suite * update suite_FV3_global_nest_v1.xml for scheme rename/rearrangement * point to lisa/C3_updates --------- Co-authored-by: Lisa Bengtsson <Lisa.Bengtsson@noaa.gov>
Expected behavior
global_fv3gfs.x
should successfully read warm start files from theRESTART
directory.Current behavior
The 2022031612 forecast for mem004 hung while attempting to read RESTART files from the previous cycle. The below is taken from
/scratch1/NCEPDEV/stmp2/Haixia.Liu/ROTDIRS/v16_ctl_march/logs/2022031612/gdasefcs02.log.2
Tracer cld_amt initialized with surface value of 0.100000E+31 and vertical multiplier of 1.000000 Warm starting, calling fv_io_restart ptop & ks 0.9990000 39 NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 201168. in fv_restart ncnst= 9 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 30012219.0 ON h4c22 CANCELLED AT 2022-03-30T23:50:15 DUE TO TIME LIMIT *** slurmstepd: error: *** JOB 30012219 ON h4c22 CANCELLED AT 2022-03-30T23:50:15 DUE TO TIME LIMIT ***
The job was killed by the system after reaching the specified 40 minute wall clock limit. Normally each forecast takes about 5 minutes to run. Each efcs job in the v16_ctl_march processes two enkf members. 40 minutes is more than enough time for two forecasts.
Repeated submissions of efcs02 yielded the same mem004 hang behavior.
Machines affected
Hera
To Reproduce
Submit the gdasefcs02 job for 2022031612 using the original 2022031606 RESTART files. mem004 is the second member run by efcs02.
Context
Given that the model hung reading RESTART files from 2022031606, the efcs02 job for 2022031606 was rerun. New files were written to the RESTART directory. After this the 2022031612 efcs02 was submitted. This rerun ran to completion.
Detailed Description
It is unknown why the forecast model could not read the original set of RESTART files from 2022031606.
Additional Information
None
Possible Implementation
Examine the forecast model and scripts to see if any reasons for RESTART tile file corruption can occur. If issue(s) are identify, resolve.
Note: @HaixiaLiu-NOAA is running the v16_ctl_march. The issue is opened on her behalf. Please direct inquiries to her.
The text was updated successfully, but these errors were encountered: