Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager Resiliency: Check the status of any tasks that were running via slurm. Resubmit any failed tasks. #675

Closed
aquan9 opened this issue Apr 10, 2023 · 4 comments
Assignees

Comments

@aquan9
Copy link
Collaborator

aquan9 commented Apr 10, 2023

Pieces broken up from #614

@pagrubel
Copy link
Collaborator

Should this be optional or automatic?
Should there be a configuration option that only tries x number of times?

@jtronge
Copy link
Collaborator

jtronge commented Jan 16, 2024

I wonder if checkpoint-restart would affect this? We might want to only resubmit tasks if the num_tries in beeflow:CheckpointRequirement allows it. I'm not sure if the number of times a task has already been restarted is stored in a database, so we might lose it on task manager failure.

@jtronge jtronge self-assigned this Jan 30, 2024
@pagrubel
Copy link
Collaborator

pagrubel commented Feb 5, 2024

I think this issue needs some discussion and clarification, maybe during a meeting.

@jtronge
Copy link
Collaborator

jtronge commented May 22, 2024

Resolved with PR #827

@jtronge jtronge closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants