Clean up pending pods for cancelled jobs #406
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
Fixes #392.
How
imagePullBackOffWatcher
is nowpodWatcher
, and has a new responsibility: starting and stopping goroutines that poll Buildkite for job state. Each new goroutine should only run for a pod that is in the Pending phase: k8s has accepted the pod, but it isn't running yet. When it enters Running phase, then the agent within the pod can be responsible for handling cancellation.Starting the cancel checker could be done from the scheduler after it has submitted the job to k8s, but this spreads responsibility between two somewhat distinct parts of the code. Putting it in the artist-formerly-known-as-imagePullBackOffWatcher centralises it, and also means it has a chance to find and clean up pods that were created before a controller restart.
The main downside to this approach is that the additional queries will eat into the GraphQL quota for the user. One possible solution might be the "job handover" idea (i.e. the stack controller acquires jobs like an agent, polls the Agent REST API for cancellation, then hands over the job to the agent in the pod when ready...)