Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up pending pods for cancelled jobs #406

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Oct 29, 2024

Why

Fixes #392.

How

imagePullBackOffWatcher is now podWatcher, and has a new responsibility: starting and stopping goroutines that poll Buildkite for job state. Each new goroutine should only run for a pod that is in the Pending phase: k8s has accepted the pod, but it isn't running yet. When it enters Running phase, then the agent within the pod can be responsible for handling cancellation.

Starting the cancel checker could be done from the scheduler after it has submitted the job to k8s, but this spreads responsibility between two somewhat distinct parts of the code. Putting it in the artist-formerly-known-as-imagePullBackOffWatcher centralises it, and also means it has a chance to find and clean up pods that were created before a controller restart.

The main downside to this approach is that the additional queries will eat into the GraphQL quota for the user. One possible solution might be the "job handover" idea (i.e. the stack controller acquires jobs like an agent, polls the Agent REST API for cancellation, then hands over the job to the agent in the pod when ready...)

@DrJosh9000 DrJosh9000 force-pushed the clean-up-pending-cancelled branch 4 times, most recently from 46ea37b to 6999b22 Compare October 29, 2024 05:28
@DrJosh9000 DrJosh9000 marked this pull request as ready for review October 29, 2024 05:29
@DrJosh9000 DrJosh9000 force-pushed the clean-up-pending-cancelled branch 4 times, most recently from cb9dc30 to 873f348 Compare October 29, 2024 06:28
Copy link
Contributor

@CerealBoy CerealBoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 question from me, otherwise looks great!

internal/controller/scheduler/pod_watcher.go Show resolved Hide resolved
@DrJosh9000 DrJosh9000 merged commit 637da43 into main Oct 30, 2024
1 check passed
@DrJosh9000 DrJosh9000 deleted the clean-up-pending-cancelled branch October 30, 2024 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller does not Delete Pending Jobs from K8S if they're cancelled on Buildkite — flooding the cluster
2 participants