You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR Would be nice to have a failsafe where the main job re-submits itself to the System if approaching walltime
I have consistently run into an issue where my main SeisFlows job, which usually sits on a cluster node, will hit walltime before the workflow has completed. Usually this is the result of run tasks (submitted to cluster) getting stuck in the queue, and the main job essentially sitting and waiting until it hits walltime.
Usually walltime is hard set by the cluster (e.g., 24hr) and so this is not something that can be remedied by increasing walltime. One solution to this is to submit the main job directly to the login node, but at the moment the code is not optimized well enough and some large array manipulations will take place on the main job which cluster sys admins may not like.
One possible alternative is to have the main job monitor its own walltime, and if it gets too close while still running through the workflow, it submits a new main job in its place, as a sort of 'failsafe'. It would need to send information about current position in the workflow and current jobs that are sitting in the queue. Likely this would be critically tied to the checkpointing system #144 that is in the works.
The text was updated successfully, but these errors were encountered:
TL;DR Would be nice to have a failsafe where the main job re-submits itself to the System if approaching walltime
I have consistently run into an issue where my main SeisFlows job, which usually sits on a cluster node, will hit walltime before the workflow has completed. Usually this is the result of run tasks (submitted to cluster) getting stuck in the queue, and the main job essentially sitting and waiting until it hits walltime.
Usually walltime is hard set by the cluster (e.g., 24hr) and so this is not something that can be remedied by increasing walltime. One solution to this is to submit the main job directly to the login node, but at the moment the code is not optimized well enough and some large array manipulations will take place on the main job which cluster sys admins may not like.
One possible alternative is to have the main job monitor its own walltime, and if it gets too close while still running through the workflow, it submits a new main job in its place, as a sort of 'failsafe'. It would need to send information about current position in the workflow and current jobs that are sitting in the queue. Likely this would be critically tied to the checkpointing system #144 that is in the works.
The text was updated successfully, but these errors were encountered: