-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pause running workflows when stopping beeflow with 'beeflow core stop' #830
Conversation
f4157d0
to
583477c
Compare
I need to do some overnight testing on this branch. |
- Pause running workflows when stopping beeflow with 'beeflow core stop' - Add info about stoping beeflow to documentation - Set workflow state to Running when 'beeflow resume <wf_id>' command is issued - Display beeflow version when starting core - Add job updates for tasks job to log - Add final job status check with sacct command when state cannot be found otherwise
5edfa11
to
371c4fa
Compare
To test this for CLI (on darwin) you may want two screens both in the poetry env for this branch:
To test for slurmrestd:
|
Adding this thought here to implement later: The output from "sacct -j <job_id>" is in column form where the job_id is a row heading and Job State is a column heading. That should be used instead of the location in case another column or row is added in the future. |
2cb37d6
to
fc9f165
Compare
@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be. |
@jtronge I tested this on the prod system and it works. The only additional feature we may want to add at some point is for the user to be able to resume all paused workflows with one command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. Since I made a couple of commits, should someone else approve as well?
log.info(f'Resubmitting task {task.name}') | ||
db.job_queue.remove_by_id(id_) | ||
job_state = submit_task(db, worker, task) | ||
db.update_queue.push(task.workflow_id, task.id, job_state) | ||
else: | ||
db.update_queue.push(task.workflow_id, task.id, new_job_state) | ||
|
||
if job_state in ('ZOMBIE', 'COMPLETED', 'CANCELLED', 'FAILED', 'TIMEOUT', 'TIMELIMIT'): | ||
if job_state in ('UNKNOWN', 'COMPLETED', 'CANCELLED', 'FAILED', 'TIMEOUT', 'TIMELIMIT'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get rid of ZOMBIE
altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ZOMBIE was meant for Workflow states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for jobs
Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch |
Should we put WIP back on do this. I'm not sure there is any way to test it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks ok to me.
This PR pauses any running or waiting workflows when stopping beeflow.
Tasks/jobs that are running or pending when beeflow is stopped will be updated, if possible, once beeflow starts up again.
Addresses #783