Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause running workflows when stopping beeflow with 'beeflow core stop' #830

Merged
merged 13 commits into from
Jul 23, 2024

Conversation

pagrubel
Copy link
Collaborator

@pagrubel pagrubel commented Apr 29, 2024

This PR pauses any running or waiting workflows when stopping beeflow.
Tasks/jobs that are running or pending when beeflow is stopped will be updated, if possible, once beeflow starts up again.
Addresses #783

@pagrubel pagrubel added WIP Work in progress and removed WIP Work in progress labels Apr 29, 2024
@pagrubel pagrubel force-pushed the issue783/beeflow-stop-running-wf branch from f4157d0 to 583477c Compare April 29, 2024 19:40
@pagrubel pagrubel requested review from jtronge and rstyd April 29, 2024 23:38
@pagrubel pagrubel added the WIP (no-ci) Don't run any CI for this PR label Apr 30, 2024
@pagrubel
Copy link
Collaborator Author

I need to do some overnight testing on this branch.

@pagrubel pagrubel added WIP Work in progress and removed WIP (no-ci) Don't run any CI for this PR labels May 22, 2024
   - Pause running workflows when stopping beeflow with 'beeflow core stop'
   - Add info about stoping beeflow to documentation
   - Set workflow state to Running when 'beeflow resume <wf_id>' command is issued
   - Display beeflow version when starting core
   - Add job updates for tasks job to log
   - Add final job status check with sacct command when state cannot be found otherwise
@pagrubel pagrubel force-pushed the issue783/beeflow-stop-running-wf branch from 5edfa11 to 371c4fa Compare May 24, 2024 19:04
@pagrubel
Copy link
Collaborator Author

pagrubel commented May 28, 2024

To test this for CLI (on darwin) you may want two screens both in the poetry env for this branch:

  • Make sure use_commands in the slurm portion of bee.conf is True.
  • Start beeflow: beeflow core start
  • submit a workflow, clamr example works well, the checkpoint workflow is also good
    watch -n 2 query <wf_id> to make sure the task is pending or running.
    It is helpful to do this in a separate screen and keep it up even when you stop beeflow, or when you restart it
  • verify that the clamr step is running and get the job id via squeue -u <username> or the task manager log
  • issue beeflow core stop while clamr step is running
  • watch -n 5 show job <job_id> Wait until this gives a no job type error
  • Then start beeflow up again beeflow core start. The query screen will show that clamr has completed.
  • The workflow has been paused so to finish it just submit the beeflow resume <wf_id>

To test for slurmrestd:

  • Make sure use_commands in the slurm portion of bee.conf is False.
    • Start beeflow: beeflow core start
    • submit a workflow, clamr example works well, the checkpoint workflow is also good
      watch -n 2 query <wf_id> to make sure the task is pending or running.
      It is helpful to do this in a separate screen and keep it up even when you stop beeflow, or when you restart it
    • verify that the clamr step is running and get the job id via squeue -u <username> or the task manager log
    • issue beeflow core stop while clamr step is running
    • watch -n 5 squeue -u <userid> Wait until the clamr job is off the screen
    • Then start beeflow up again beeflow core start. The query screen will show that clamr has completed.
    • The workflow has been paused so to finish it just submit the beeflow resume <wf_id>

@pagrubel
Copy link
Collaborator Author

pagrubel commented May 29, 2024

Adding this thought here to implement later: The output from "sacct -j <job_id>" is in column form where the job_id is a row heading and Job State is a column heading. That should be used instead of the location in case another column or row is added in the future.

@jtronge jtronge force-pushed the issue783/beeflow-stop-running-wf branch from 2cb37d6 to fc9f165 Compare May 30, 2024 16:40
@pagrubel
Copy link
Collaborator Author

pagrubel commented Jun 4, 2024

@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.

@pagrubel
Copy link
Collaborator Author

pagrubel commented Jun 6, 2024

@jtronge I tested this on the prod system and it works. The only additional feature we may want to add at some point is for the user to be able to resume all paused workflows with one command.

@pagrubel pagrubel removed the WIP Work in progress label Jun 6, 2024
Copy link
Collaborator

@jtronge jtronge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Since I made a couple of commits, should someone else approve as well?

log.info(f'Resubmitting task {task.name}')
db.job_queue.remove_by_id(id_)
job_state = submit_task(db, worker, task)
db.update_queue.push(task.workflow_id, task.id, job_state)
else:
db.update_queue.push(task.workflow_id, task.id, new_job_state)

if job_state in ('ZOMBIE', 'COMPLETED', 'CANCELLED', 'FAILED', 'TIMEOUT', 'TIMELIMIT'):
if job_state in ('UNKNOWN', 'COMPLETED', 'CANCELLED', 'FAILED', 'TIMEOUT', 'TIMELIMIT'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get rid of ZOMBIE altogether?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ZOMBIE was meant for Workflow states

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for jobs

@jtronge
Copy link
Collaborator

jtronge commented Jun 6, 2024

@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.

Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch WorkerError and remove the job in that case.

@pagrubel
Copy link
Collaborator Author

pagrubel commented Jun 6, 2024

@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.

Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch WorkerError and remove the job in that case.

Should we put WIP back on do this. I'm not sure there is any way to test it.

@pagrubel pagrubel added the WIP Work in progress label Jun 9, 2024
@pagrubel pagrubel removed the WIP Work in progress label Jul 8, 2024
@rstyd rstyd requested a review from jtronge July 16, 2024 17:15
Copy link
Collaborator

@aquan9 aquan9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks ok to me.

@pagrubel pagrubel merged commit b4bf36a into develop Jul 23, 2024
24 checks passed
@pagrubel pagrubel deleted the issue783/beeflow-stop-running-wf branch July 23, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants