Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable auto-scaling of agents #70

Open
ChristianKuehnel opened this issue Dec 5, 2019 · 2 comments
Open

enable auto-scaling of agents #70

ChristianKuehnel opened this issue Dec 5, 2019 · 2 comments
Labels
enhancement New feature or request performance

Comments

@ChristianKuehnel
Copy link
Collaborator

It would be nice to automatically scale up and down the number of agents based on the length of the build queue.

@ChristianKuehnel ChristianKuehnel added the enhancement New feature or request label Dec 20, 2019
@ChristianKuehnel
Copy link
Collaborator Author

I looked into scaling the number of agents based on the load:

  • Kubernetes scaling in general:
    • Kubernetes has different ways of scaling pods (and agents in them) up/down. So either we can configure a custom metric or we need a separate process that monitors the queue length and then scales the agents.
    • The agent process need to react to SIGTERM as indication to shut down. Ideally it finishes the current build and then exits.
    • Then Kubernetes waits for terminationGracePeriodSeconds before sending a SIGKILL, this can be configured in the deployment.
    • So we need to set the terminationGracePeriodSeconds to the longest build time on a certain machine (e.g. 1-2 hours).
  • For Jenkins swarm agents:
    • They are not reacting to SIGTERM at all.
    • So we would need to script this: create a preStop hook that: 1) marks the agent as offline on the master so it does not get any new jobs and 2) once the node is not building anything: killall java.
    • I would avoid the effort here and wait until we're on Buildkite, that makes things much easier.
  • Buildkite agents should be doing fine. On SIGTERM they stop accepting new jobs and exit once the current job finishes. So we just need to set a good terminationGracePeriodSeconds and we're done.
  • The Windows agents are currently running in docker containers in normal Windows VMs. For scaling we would have to migrate them to Kubernetes first, see See if we can use Kubernetes for Windows #115.

@ChristianKuehnel
Copy link
Collaborator Author

Buildkite offers an API to collect metrics that can be used for scaling: https://github.com/buildkite/buildkite-agent-metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

1 participant