enable auto-scaling of agents #70

ChristianKuehnel · 2019-12-05T09:32:57Z

It would be nice to automatically scale up and down the number of agents based on the length of the build queue.

ChristianKuehnel · 2020-05-11T13:06:08Z

I looked into scaling the number of agents based on the load:

Kubernetes scaling in general:
- Kubernetes has different ways of scaling pods (and agents in them) up/down. So either we can configure a custom metric or we need a separate process that monitors the queue length and then scales the agents.
- The agent process need to react to SIGTERM as indication to shut down. Ideally it finishes the current build and then exits.
- Then Kubernetes waits for terminationGracePeriodSeconds before sending a SIGKILL, this can be configured in the deployment.
- So we need to set the terminationGracePeriodSeconds to the longest build time on a certain machine (e.g. 1-2 hours).
For Jenkins swarm agents:
- They are not reacting to SIGTERM at all.
- So we would need to script this: create a preStop hook that: 1) marks the agent as offline on the master so it does not get any new jobs and 2) once the node is not building anything: killall java.
- I would avoid the effort here and wait until we're on Buildkite, that makes things much easier.
Buildkite agents should be doing fine. On SIGTERM they stop accepting new jobs and exit once the current job finishes. So we just need to set a good terminationGracePeriodSeconds and we're done.
The Windows agents are currently running in docker containers in normal Windows VMs. For scaling we would have to migrate them to Kubernetes first, see See if we can use Kubernetes for Windows #115.

ChristianKuehnel · 2020-06-03T15:38:10Z

Buildkite offers an API to collect metrics that can be used for scaling: https://github.com/buildkite/buildkite-agent-metrics

ChristianKuehnel added the enhancement New feature or request label Dec 20, 2019

ChristianKuehnel added the performance label Apr 17, 2020

GMNGeoffrey mentioned this issue Jul 28, 2021

Very long wait times for Windows agents #336

Closed

Provide feedback