Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C192L127 forecast hung with corrupted RESTART tiles (Hera) #709

Closed
RussTreadon-NOAA opened this issue Apr 4, 2022 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@RussTreadon-NOAA
Copy link
Contributor

Expected behavior
global_fv3gfs.x should successfully read warm start files from the RESTART directory.

Current behavior
The 2022031612 forecast for mem004 hung while attempting to read RESTART files from the previous cycle. The below is taken from /scratch1/NCEPDEV/stmp2/Haixia.Liu/ROTDIRS/v16_ctl_march/logs/2022031612/gdasefcs02.log.2

Tracer cld_amt initialized with surface value of 0.100000E+31 and vertical multiplier of 1.000000 Warm starting, calling fv_io_restart ptop & ks 0.9990000 39 NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 201168. in fv_restart ncnst= 9 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 30012219.0 ON h4c22 CANCELLED AT 2022-03-30T23:50:15 DUE TO TIME LIMIT *** slurmstepd: error: *** JOB 30012219 ON h4c22 CANCELLED AT 2022-03-30T23:50:15 DUE TO TIME LIMIT ***

The job was killed by the system after reaching the specified 40 minute wall clock limit. Normally each forecast takes about 5 minutes to run. Each efcs job in the v16_ctl_march processes two enkf members. 40 minutes is more than enough time for two forecasts.

Repeated submissions of efcs02 yielded the same mem004 hang behavior.

Machines affected
Hera

To Reproduce
Submit the gdasefcs02 job for 2022031612 using the original 2022031606 RESTART files. mem004 is the second member run by efcs02.

Context
Given that the model hung reading RESTART files from 2022031606, the efcs02 job for 2022031606 was rerun. New files were written to the RESTART directory. After this the 2022031612 efcs02 was submitted. This rerun ran to completion.

Detailed Description
It is unknown why the forecast model could not read the original set of RESTART files from 2022031606.

Additional Information
None

Possible Implementation
Examine the forecast model and scripts to see if any reasons for RESTART tile file corruption can occur. If issue(s) are identify, resolve.

Note: @HaixiaLiu-NOAA is running the v16_ctl_march. The issue is opened on her behalf. Please direct inquiries to her.

@RussTreadon-NOAA RussTreadon-NOAA added the bug Something isn't working label Apr 4, 2022
@aerorahul
Copy link
Contributor

@RussTreadon-NOAA @HaixiaLiu-NOAA
It appears that re-running the model produced valid files.
Is it possible that the model created corrupt files during the initial run?
Is this a result of a machine glitch?

In order to implement a complete validator of restart files will need considerable development.
How does one validate a corrupt restart file?
What does the acceptance criteria for such a validator look like?
Does that add time to the job? If so, how much and how will that impact total job times?
Simply checking for file existence is not enough for this exercise.

@RussTreadon-NOAA
Copy link
Contributor Author

Thanks, @aerorahul , for replying to this issue. It is not clear if file corruption points to a machine (Hera) issue, a supporting library, or something in the forecast model.

I agree. As this case demonstrates file existence is an insufficient validator. Perhaps UFS modelers have ideas for an efficient and computationally inexpensive validator.

Please feel free to close this issue if resolution lies outside g-w.

@aerorahul
Copy link
Contributor

The issue is specific to a RESTART tile, but should be resolved with a general file validator.
Doing so will require considerable enabler activity.
Closing.

kayeekayee pushed a commit to kayeekayee/global-workflow that referenced this issue May 30, 2024
* new global_nest_v1 suite

* switch to ugwpv1 in global_nest_v1 suite

* update suite_FV3_global_nest_v1.xml for scheme rename/rearrangement

* point to lisa/C3_updates

---------

Co-authored-by: Lisa Bengtsson <Lisa.Bengtsson@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants