Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jet time-out issue cases #1695

Closed
jkbk2004 opened this issue Apr 3, 2023 · 12 comments
Closed

Jet time-out issue cases #1695

jkbk2004 opened this issue Apr 3, 2023 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Apr 3, 2023

Description

Solution

  • Working or tuned configuration of these cases: either reduced forecast time or namelist option, etc.
  • Or proper conclusion if related to Jet system issue

Related to

@DavidHuber-NOAA
Copy link
Collaborator

It may be worth noting that these same tests timeout on S4, which has similar architecture to xjet.

@zach1221 zach1221 moved this from Todo to In Progress in Backlog: platforms and RT Apr 25, 2023
@zach1221 zach1221 self-assigned this May 23, 2023
@zach1221
Copy link
Collaborator

@jkbk2004 regional_atmaq regional_atmaq_faster are still failing. However, regional_noquilt, hafs_regional_datm_cdeps, regional_wofs are now passing on Jet.

@DavidHuber-NOAA
Copy link
Collaborator

@jkbk2004 I'll give these a try on S4 as well.

@FernandoAndrade-NOAA
Copy link
Collaborator

Running with the current develop, regional_noquilt hafs_regional_datm_cdeps and regional_wofs pass,
while regional_atmaq and regional_atmaq_faster are still failing.

Test Directory: /lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_113246/

regional_atmaq I'm seeing a 137 exit code in out and various SIGTERM 78 errors in err

srun: launch/slurm: _step_signal: Terminating StepId=59439343.0
0: slurmstepd: error: *** STEP 59439343.0 ON x111 CANCELLED AT 2023-10-17T22:15:38 ***
108: forrtl: error (78): process killed (SIGTERM)

In regional_atmaq_faster, I'm seeing a memory error
72: forrtl: severe (41): insufficient virtual memory

@zach1221 zach1221 added the bug Something isn't working label Oct 24, 2023
@jkbk2004
Copy link
Collaborator Author

Running with the current develop, regional_noquilt hafs_regional_datm_cdeps and regional_wofs pass, while regional_atmaq and regional_atmaq_faster are still failing.

Test Directory: /lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_113246/

regional_atmaq I'm seeing a 137 exit code in out and various SIGTERM 78 errors in err

srun: launch/slurm: _step_signal: Terminating StepId=59439343.0
0: slurmstepd: error: *** STEP 59439343.0 ON x111 CANCELLED AT 2023-10-17T22:15:38 ***
108: forrtl: error (78): process killed (SIGTERM)

In regional_atmaq_faster, I'm seeing a memory error 72: forrtl: severe (41): insufficient virtual memory

@FernandoAndrade-NOAA what about ulimit -s unlimited option in jet job card?

@FernandoAndrade-NOAA
Copy link
Collaborator

FernandoAndrade-NOAA commented Oct 31, 2023

Running with the current develop, regional_noquilt hafs_regional_datm_cdeps and regional_wofs pass, while regional_atmaq and regional_atmaq_faster are still failing.
Test Directory: /lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_113246/
regional_atmaq I'm seeing a 137 exit code in out and various SIGTERM 78 errors in err

srun: launch/slurm: _step_signal: Terminating StepId=59439343.0
0: slurmstepd: error: *** STEP 59439343.0 ON x111 CANCELLED AT 2023-10-17T22:15:38 ***
108: forrtl: error (78): process killed (SIGTERM)

In regional_atmaq_faster, I'm seeing a memory error 72: forrtl: severe (41): insufficient virtual memory

@FernandoAndrade-NOAA what about ulimit -s unlimited option in jet job card?

Adding that option has at least caused the same insufficient virtual memory error to show up in regional_atmaq err log as well now in addition to regional_atmaq_faster, but they are both still failing.

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented Nov 1, 2023

@FernandoAndrade-NOAA can you try TPN=16 for those tests?

@FernandoAndrade-NOAA
Copy link
Collaborator

@FernandoAndrade-NOAA can you try TPN=16 for those tests?

Sure thing, it was also suggested to try ulimit -l unlimited, so I'll add that as well. I believe @zach1221 has previously tried a different TPN for jet. Zach do you remember if you had used 16 or 18?

@FernandoAndrade-NOAA
Copy link
Collaborator

Both are unfortunately still failing after those adjustments. I am seeing slightly different err messages:
/lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_13549/

80: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=60446146.0. Some of your processes may have been killed by the cgroup out-of-memory
 handler.
176: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=60446146.0. Some of your processes may have been killed by the cgroup out-of-memory
 handler.
srun: error: x278: task 80: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=60446146.0
  0: slurmstepd: error: *** STEP 60446146.0 ON x172 CANCELLED AT 2023-11-01T16:47:00 ***
271: forrtl: error (78): process killed (SIGTERM)```

@zach1221
Copy link
Collaborator

zach1221 commented Nov 1, 2023

@FernandoAndrade-NOAA can you try TPN=16 for those tests?

Sure thing, it was also suggested to try ulimit -l unlimited, so I'll add that as well. I believe @zach1221 has previously tried a different TPN for jet. Zach do you remember if you had used 16 or 18?

Hey, @FernandoAndrade-NOAA . Yes I attempted 18 in the past with no luck.

@zach1221
Copy link
Collaborator

zach1221 commented Mar 5, 2024

@jkbk2004 I re-tested these on Rocky8 and they passed, specifically regional_noquilt, hafs_regional_datm_cdeps, and regional_wofs. It doesn't look like regional_atmaq_faster is in rt.conf anymore and regional_atmaq has been hashed out, but I ran the atmaq compile suite successfully.

@zach1221
Copy link
Collaborator

zach1221 commented Mar 5, 2024

@jkbk2004 per our conversation, closing this issue.

@zach1221 zach1221 closed this as completed Mar 5, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Backlog: platforms and RT Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

5 participants