-
-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runners not scaling, and "Unable to query docker version" #1134
Comments
Did this happen after upgrading to 7.7 or did you set up a new runner from scratch? |
Great question! I should have addressed this. Fresh install. I have 3 isolated installs running, each with the same behavior. We are migrating away from 6.5.2, where that was configured as a single instance attached at the group level (7 groups, so 7 isolated runners). We've been using the Cattle Ops terraform for almost 2 years, but this is our first try at using it with autoscaling. |
So the difference between the installations is the autoscaling, right? Everything else remained the same, especially the images used for the runner machine and the workers? |
VPCs. IAM and security groups stayed the same. Other than that, quite different. For our 6.5.x runners, we are using Method 3 from the README:
So we only have the Amazon Linux instance. For 7.7.0 we started with examples/runner-pre-registered/main.tf , and commented out all the network and security groups, since those are already defined. The issue we are seeing is the runner manager will spin up as many Ubuntu worker instances as we specify in IdleCountMin, but no more than that.
As a test I specified in the main.tf to use ubuntu 18.04 and now
However, I still don't see auto-scaling. Just the number of idle runners. That eliminates my thought that it was docker-machine errors causing the lack of autoscaling. How does the runner manager determine when to spin up new workers? I attached a snippet of the logs, and we can see jobs in queue for 500 and 700 seconds, and gitlab-runner "Using existing docker-machine": |
I made progress on this. The from the module's variables.tf
I had max_jobs set to 2 and what I saw was only two runners and two jobs would be handled at a time for a pipeline. Changing this to zero, and now the runners autoscale. 12 jobs in parallel trigger trigger creation of 12 runners. I am confused by the wording here in the variables.tf file. This setting seems to limit the number of workers that are spun up when autoscaling is used. This
This is what I wanted, to limit each runner to run 2 jobs in parallel. However: "View how this setting works with the Docker Machine executor (for autoscaling)." makes it clear the behavior I have been seeing:
Perhaps the description of max_jobs can be clarified in the variables.tf file? Next I am trying to understand the difference between idle_count in these two settings: runner_worker_docker_machine_instance = { and Seems that idle_count in runner_worker_docker_machine_autoscaling_options is the one that controls how many idle runners, yes? |
We're seeing the same error message in
Is it related to this issue? How does docker-machine install docker on the new nodes? Can we control whether it installs a version that is incompatible (27.1.1 in our case)? |
@LTegtmeier Do you use your own AMIs for the Runner and the Workers? Can you try the default ones? I think, docker machine is using a SSH connection to the worker and installs all per-requisites before starting the pipeline. |
Looking again, the default changed since I last copied it into a parameter. I did that to make it easier to toggle between AMD and ARM. These runners are AMD and the runner AMI uses the older We end up with
That's what it looks like. If I start the AMI on a test EC2, Docker isn't installed at all. Docker Machine correctly installs it but, at a version that's not fully compatible with the docker machine version. |
We never solved this issue with Docker Machine or understood the root cause. We moved to the |
@LTegtmeier I have fleeting enabled here. Could you please try with these AMIs? Runner: The module configuration looks pretty straight forward (some of the options removed): runner_worker_docker_machine_fleet = {
enable = true
}
runner_worker_docker_machine_instance = {
name_prefix = "${var.runner_settings.runner_name}-${each.value.availability_zone}"
types = var.runner_settings.worker_instance_types
}
runner_worker = {
ssm_access = true
request_concurrency = 1
type = "docker+machine"
} |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days. |
Describe the bug
version = "7.7.0"
I am not seeing new runners autoscale when jobs are queued up. And
docker-machine ls
has errors:Also I see docker 26 on the runner, regardless of what version is specified in the terraform and what shows in the config.toml
It says minimum version 1.24, which is the same error
docker-machine ls
sees:"Minimum supported API version is 1.24"
I might be misunderstanding how this works, but I think the mis-match in docker api version means it can't be detected that the runner is busy, so the new runners aren't spun up. Am I close?
To Reproduce
Terraform apply
Expected behavior
I have a test pipeline that spins up 12 jobs in parallel, and I only see 2 runners spun up with idle=1, 2 jobs per runner. Only 2 jobs are running and the others are queued.
I expect 5-7 runners to spin up to pick up the jobs.
Additional context
We can see default Ubuntu 20.04 is installing docker 26 in the ubuntu runner logs:
I tried specifying docker 20, 24, and removing the variable so the default 18 would install, but I don't see it happening.
docker_version = "public.ecr.aws/docker/library/docker:20"
Terraform values:
config.toml
Let me know any more info to supply and what else I can try.
The text was updated successfully, but these errors were encountered: