Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Batch job control #308

Merged
merged 7 commits into from
Sep 21, 2023
Merged

AWS Batch job control #308

merged 7 commits into from
Sep 21, 2023

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Sep 12, 2023

See commit messages for details, and also the new changelog entries.

This work is motivated by deeper integration for GitHub Actions workflows that launch builds as AWS Batch jobs.

Checklist

  • Checks pass

@tsibley tsibley requested a review from a team September 12, 2023 22:19
@tsibley tsibley force-pushed the trs/aws-batch/job-control branch from e3888cd to abf7768 Compare September 13, 2023 17:41
@tsibley
Copy link
Member Author

tsibley commented Sep 13, 2023

Repushed with some minor refinements re: the clarity of code around job.stopped vs. job.stop_sent and the UX around --cancel.

@joverlee521 joverlee521 self-requested a review September 20, 2023 00:08
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the new --cancel flag, that will be really useful for stopping trial builds 🌟

I'm a little unsure about the expected use of the --detach-on-interrupt option in the context of GitHub Action managed builds. This is definitely due to my lack of understanding of signals and how signals would propagate through GitHub Actions. (Still processing this GitHub community forum and this demo of GitHub actions signal handling.)

Can definitely test this out myself later, but wanted to leave these questions:

  • Would --detach-on-interrupt be used to chain the jobs to work around the 6hr job limit?
  • Does canceling a GitHub Action workflow also cancel the attached AWS Batch job?

nextstrain/cli/command/build.py Outdated Show resolved Hide resolved
@tsibley
Copy link
Member Author

tsibley commented Sep 21, 2023

  • Would --detach-on-interrupt be used to chain the jobs to work around the 6hr job limit?

Yep. See the relevant bit of the WIP (though I've set that down for now given other priorities).

  • Does canceling a GitHub Action workflow also cancel the attached AWS Batch job?

Yes, without --detach-on-interrupt given and with this PR's change to only require a single Ctrl-C/SIGINT when stdin is not a tty. Not currently.

In the WIP, I handle workflow run cancellation explicitly and indeed make sure it cancels the AWS Batch job.

I really like the new --cancel flag, that will be really useful for stopping trial builds 🌟

\o/

The signal.signal() repetition grated, particularly as I found myself
back in this code to add more such calls.
…t execeptions

This leads to a cleaner main loop and makes it much easier to add an
option for detaching on interrupt instead of stopping.

Diff best viewed with whitespace ignored and/or moved lines coloring, as
much of the diff is shifting code around.
… TTY

If stdin is not a TTY, then it's unlikely (though not impossible) to
receive SIGINT due to actual keyboard input (e.g. Ctrl-C) and more
likely to be programmatically signaled.  Confirmation is an unusual
hindrance in that case.
Its sometimes useful to treat Ctrl-C/SIGINT as non-terminating for the
remote job, for example in automation contexts (e.g. GitHub Actions)
where SIGINT may be sent as part of the automation service's job control
system or even merely as a molly-guard to prevent accidental
cancellation when attaching to a build to only observe it.
SIGHUP happens, for example, if the user closes a terminal (or the
terminal dies on its own) while our process is still running.  It can
happen in other cases too.  The remote job stays running whether we
handle SIGHUP or not, but it's nice to detach cleanly and be explicit
about our expectations.
This is friendlier to programmatic usage than attaching in the
background and sending a SIGINT.  It's also bound to be handy for direct
usage by folks.
@tsibley tsibley force-pushed the trs/aws-batch/job-control branch from abf7768 to d2375d1 Compare September 21, 2023 17:12
@tsibley
Copy link
Member Author

tsibley commented Sep 21, 2023

Repushed to add suggested clarification to option help and resolve conflicts in the changelog by rebasing onto latest master.

@tsibley
Copy link
Member Author

tsibley commented Sep 21, 2023

I'm a little unsure about the expected use of the --detach-on-interrupt option in the context of GitHub Action managed builds. This is definitely due to my lack of understanding of signals and how signals would propagate through GitHub Actions. (Still processing this GitHub community forum and this demo of GitHub actions signal handling.)

Yeah, it's kind of a mess. Signal use/propagation in GitHub Actions is lightly documented but not comprehensively.

We can do our own propagation if necessary, e.g. to signal the whole process group instead of just the leader. I did this in previous WIP prototypes.

  • Does canceling a GitHub Action workflow also cancel the attached AWS Batch job?

Yes, without --detach-on-interrupt given and with this PR's change to only require a single Ctrl-C/SIGINT when stdin is not a tty. Not currently.

Realizing this "yes" might still be modulated by signal propagation.

@tsibley tsibley merged commit d8ac7a4 into master Sep 21, 2023
@tsibley tsibley deleted the trs/aws-batch/job-control branch September 21, 2023 21:18
@tsibley
Copy link
Member Author

tsibley commented Sep 21, 2023

Yeah, it's kind of a mess.

Though to be fair to GitHub Actions, signalling cancellation in any system can be messy, esp. when there's support for graceful cancellation, and hard to simplify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants