C192L127 forecast hung with corrupted RESTART tiles (Hera) #709

RussTreadon-NOAA · 2022-04-04T13:56:15Z

Expected behavior
global_fv3gfs.x should successfully read warm start files from the RESTART directory.

Current behavior
The 2022031612 forecast for mem004 hung while attempting to read RESTART files from the previous cycle. The below is taken from /scratch1/NCEPDEV/stmp2/Haixia.Liu/ROTDIRS/v16_ctl_march/logs/2022031612/gdasefcs02.log.2

Tracer cld_amt initialized with surface value of 0.100000E+31 and vertical multiplier of 1.000000 Warm starting, calling fv_io_restart ptop & ks 0.9990000 39 NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 201168. in fv_restart ncnst= 9 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 30012219.0 ON h4c22 CANCELLED AT 2022-03-30T23:50:15 DUE TO TIME LIMIT *** slurmstepd: error: *** JOB 30012219 ON h4c22 CANCELLED AT 2022-03-30T23:50:15 DUE TO TIME LIMIT ***

The job was killed by the system after reaching the specified 40 minute wall clock limit. Normally each forecast takes about 5 minutes to run. Each efcs job in the v16_ctl_march processes two enkf members. 40 minutes is more than enough time for two forecasts.

Repeated submissions of efcs02 yielded the same mem004 hang behavior.

Machines affected
Hera

To Reproduce
Submit the gdasefcs02 job for 2022031612 using the original 2022031606 RESTART files. mem004 is the second member run by efcs02.

Context
Given that the model hung reading RESTART files from 2022031606, the efcs02 job for 2022031606 was rerun. New files were written to the RESTART directory. After this the 2022031612 efcs02 was submitted. This rerun ran to completion.

Detailed Description
It is unknown why the forecast model could not read the original set of RESTART files from 2022031606.

Additional Information
None

Possible Implementation
Examine the forecast model and scripts to see if any reasons for RESTART tile file corruption can occur. If issue(s) are identify, resolve.

Note: @HaixiaLiu-NOAA is running the v16_ctl_march. The issue is opened on her behalf. Please direct inquiries to her.

The text was updated successfully, but these errors were encountered:

aerorahul · 2022-04-05T14:56:26Z

@RussTreadon-NOAA @HaixiaLiu-NOAA
It appears that re-running the model produced valid files.
Is it possible that the model created corrupt files during the initial run?
Is this a result of a machine glitch?

In order to implement a complete validator of restart files will need considerable development.
How does one validate a corrupt restart file?
What does the acceptance criteria for such a validator look like?
Does that add time to the job? If so, how much and how will that impact total job times?
Simply checking for file existence is not enough for this exercise.

RussTreadon-NOAA · 2022-04-05T15:06:44Z

Thanks, @aerorahul , for replying to this issue. It is not clear if file corruption points to a machine (Hera) issue, a supporting library, or something in the forecast model.

I agree. As this case demonstrates file existence is an insufficient validator. Perhaps UFS modelers have ideas for an efficient and computationally inexpensive validator.

Please feel free to close this issue if resolution lies outside g-w.

aerorahul · 2022-04-05T15:09:54Z

The issue is specific to a RESTART tile, but should be resolved with a general file validator.
Doing so will require considerable enabler activity.
Closing.

* new global_nest_v1 suite * switch to ugwpv1 in global_nest_v1 suite * update suite_FV3_global_nest_v1.xml for scheme rename/rearrangement * point to lisa/C3_updates --------- Co-authored-by: Lisa Bengtsson <Lisa.Bengtsson@noaa.gov>

RussTreadon-NOAA added the bug Something isn't working label Apr 4, 2022

aerorahul closed this as completed Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C192L127 forecast hung with corrupted RESTART tiles (Hera) #709

C192L127 forecast hung with corrupted RESTART tiles (Hera) #709

RussTreadon-NOAA commented Apr 4, 2022

aerorahul commented Apr 5, 2022

RussTreadon-NOAA commented Apr 5, 2022

aerorahul commented Apr 5, 2022

C192L127 forecast hung with corrupted RESTART tiles (Hera) #709

C192L127 forecast hung with corrupted RESTART tiles (Hera) #709

Comments

RussTreadon-NOAA commented Apr 4, 2022

aerorahul commented Apr 5, 2022

RussTreadon-NOAA commented Apr 5, 2022

aerorahul commented Apr 5, 2022