Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failsafe: re-submit workflow on main job walltime error #177

Open
bch0w opened this issue Aug 28, 2023 · 0 comments
Open

failsafe: re-submit workflow on main job walltime error #177

bch0w opened this issue Aug 28, 2023 · 0 comments

Comments

@bch0w
Copy link
Member

bch0w commented Aug 28, 2023

TL;DR Would be nice to have a failsafe where the main job re-submits itself to the System if approaching walltime

I have consistently run into an issue where my main SeisFlows job, which usually sits on a cluster node, will hit walltime before the workflow has completed. Usually this is the result of run tasks (submitted to cluster) getting stuck in the queue, and the main job essentially sitting and waiting until it hits walltime.

Usually walltime is hard set by the cluster (e.g., 24hr) and so this is not something that can be remedied by increasing walltime. One solution to this is to submit the main job directly to the login node, but at the moment the code is not optimized well enough and some large array manipulations will take place on the main job which cluster sys admins may not like.

One possible alternative is to have the main job monitor its own walltime, and if it gets too close while still running through the workflow, it submits a new main job in its place, as a sort of 'failsafe'. It would need to send information about current position in the workflow and current jobs that are sitting in the queue. Likely this would be critically tied to the checkpointing system #144 that is in the works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant