Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let artifact phase and post-command run in grace period #2899

Merged
merged 6 commits into from
Jul 29, 2024

Conversation

DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Jul 25, 2024

Description

Artifact phase (the built-in artifact upload performed by the agent) and post-command hooks are now allowed to run in the signal grace period (a configurable length of time after job cancellation).

The signal grace period is also now enforced within the job executor (as well as outside of it when run normally as a subprocess of the agent).

Context

Artifacts uploaded this way can be relevant to user debugging job timeouts, so we should try a bit harder to upload them.

A similar argument applies to post-command hooks, which can be used for post-job cleanup.

Both of these (and usual job teardown) are still subject to the executor being SIGKILL-ed by the agent parent process. However, some use-cases run buildkite-agent bootstrap separately to the agent ("standalone bootstrap", which happens for example in agent-stack-k8s). So the executor needs to be able to kill itself in case there's nothing else to do so, hence graceCtx instead of nonCancelCtx.

The executor (a.k.a. bootstrap) is also normally passed the (computed) signal grace period as a flag/env var from the parent agent process, but the flag shares the same definition as agent start (with default -1). If there is no parent agent process, the -1 signal grace period will cause graceCtx to be cancelled immediately after the main context is cancelled. An existing integration test shows this. So, for standalone bootstrap, the same flag computation that happens in agent start to obtain a positive signalGracePeriod should be done in bootstrap as well.

Changes

  • Add withGracePeriod helper method, to create a new context that is cancelled some period of time after another context is cancelled.
  • Use withGracePeriod to replace the existing detached nonCancelCtx with a new graceCtx. The difference is that graceCtx will still be cancelled after the signal grace period.
  • Use graceCtx for tearDown, artifactPhase, and create a grace context for runPostCommandHooks.
  • Refactor the computation with cancel-grace-period and signal-grace-period-seconds from agent start into a function shared with bootstrap.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

@DrJosh9000 DrJosh9000 marked this pull request as ready for review July 29, 2024 01:11
@DrJosh9000 DrJosh9000 requested a review from a team July 29, 2024 01:12
@wolfeidau
Copy link
Contributor

wolfeidau commented Jul 29, 2024

@DrJosh9000 how does this effect --signal-grace-period-seconds, will it like go over now that artifact uploads are outside this period?

So this change resolves issues with this flag breaking uploads 🎉

@wolfeidau
Copy link
Contributor

I think we should also replace the context for the post command hook to avoid this error.

Screenshot 2024-07-29 at 12 05 38 PM

@DrJosh9000 DrJosh9000 changed the title Detach artifact phase context Detach artifact phase and post-command contexts Jul 29, 2024
@DrJosh9000 DrJosh9000 changed the title Detach artifact phase and post-command contexts Let artifact phase and post-command run in grace period Jul 29, 2024
Copy link
Contributor

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work chasing down all these loose ends 🎉

Appreciate the extra internal documentation as well. ⭐

@DrJosh9000 DrJosh9000 merged commit adb7e8a into main Jul 29, 2024
1 check passed
@DrJosh9000 DrJosh9000 deleted the artifacts-and-timeouts branch July 29, 2024 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants