Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add force-grace-period-seconds argument to step cancel command #3084

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

mitchbne
Copy link
Member

@mitchbne mitchbne commented Nov 11, 2024

Description

We recently introduced a buildkite-agent step cancel subcommand. The command can accept a flag --force, which will immediately cancel unfinished jobs in Buildkite (i.e. mark jobs as 'canceled') without waiting for the agents running the jobs to finish. Any subsequent artifact uploads to Buildkite by these agent running this job will begin to fail.

Now this --force flag might feel a little counter-intuitive when you think about how agents have a cancel-grace-period argument - you could not use the --force flag, marking the jobs as 'canceling' and then rely on agents to SIGKILL their process once their cancel-grace-period has elapsed.

Imagine though that an agent isn't responding to Buildkite's request to cancel the job (and so we can't rely on cancel-grace-period within an agent). In this PR, we add an additional argument --force-grace-period to the subcommand to handle this. The value of force-grace-period will determine how long within Buildkite it will take for the job to move from a 'canceling' to a 'canceled' state (more information about job states can be read here). Because jobs are then in a 'canceled' state, lost agents can be reaped much faster within Buildkite.

It's worth noting:

  • Passing --force-grace-period-seconds without also supplying --force will not do anything. We rely on the --force flag to forecfully mark a job as 'canceled' within Buildkite.
  • The value of --force-grace-period-seconds must be greater than or equal to zero. Negative values just don't make sense.
  • The value of --force-grace-period-seconds is essentially overriding the value of cancel-grace-period for agents that are still connected to Buildkite. The default value of --force-grace-period-seconds if not supplied is the value of the current agent running the buildkite-agent step cancel command: if the agent running the buildkite-agent step cancel command has a cancel-grace-period value less than the agents having their jobs 'canceled', these agents will cleanup sooner than you might anticipate.

Context

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

Copy link
Member

@blaknite blaknite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM! Meets the spec as we defined 🧡

@mitchbne mitchbne marked this pull request as ready for review November 11, 2024 04:58
clicommand/step_cancel.go Outdated Show resolved Hide resolved
Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #2250 we started on a path towards parsing duration parameters as time.Duration, which means people can write more convenient expressions like 1m or 17h, but in the meantime, we would be more specific with time units. Which is why Nepa put seconds in the name of signal-grace-period-seconds - for backwards compatibility cancel-grace-period had to be left.

To be consistent 🙃 this new flag (and env var) also ought to end with -seconds (_SECONDS).

Other than one other comment, this looks pretty good to me!

clicommand/step_cancel.go Outdated Show resolved Hide resolved
@mitchbne mitchbne changed the title Add force-grace-period argument to step cancel command Add force-grace-period-seconds argument to step cancel command Nov 11, 2024
Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@mitchbne mitchbne merged commit 2a013d1 into main Nov 12, 2024
1 check passed
@mitchbne mitchbne deleted the step-cancel-grace-period-arg branch November 12, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants