Runners not scaling, and "Unable to query docker version" #1134

joerawr · 2024-06-02T01:17:23Z

Describe the bug

version = "7.7.0"

I am not seeing new runners autoscale when jobs are queued up. And docker-machine ls has errors:

docker-machine ls
NAME                                              ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-dibjhdrwq-sre-runner-1717284911-0ac9e80b   -        amazonec2   Running   tcp://10.0.0.168:2376           Unknown   Unable to query docker version: 400 Bad Request: {"message":"client version 1.15 is too old. Minimum supported API version is 1.24, please upgrade your client to a newer version"}

runner-dibjhdrwq-sre-runner-1717286598-360c96b9   -    amazonec2   Running   tcp://10.0.0.167:2376        Unknown   Unable to query docker version: 400 Bad Request: {"message":"client version 1.15is too old. Minimum supported API version is 1.24, please upgrade your client to a newer version"}

Also I see docker 26 on the runner, regardless of what version is specified in the terraform and what shows in the config.toml

root@runner-dibjhdrwq-sre-runner-1717284911-0ac9e80b:/var/log# docker version
Client: Docker Engine - Community
 Version:           26.1.3
 API version:       1.45
 Go version:        go1.21.10
 Git commit:        b72abbb
 Built:             Thu May 16 08:33:49 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.3
  API version:      1.45 (minimum version 1.24)

It says minimum version 1.24, which is the same error docker-machine ls sees:
"Minimum supported API version is 1.24"

I might be misunderstanding how this works, but I think the mis-match in docker api version means it can't be detected that the runner is busy, so the new runners aren't spun up. Am I close?

To Reproduce

Terraform apply

Expected behavior

I have a test pipeline that spins up 12 jobs in parallel, and I only see 2 runners spun up with idle=1, 2 jobs per runner. Only 2 jobs are running and the others are queued.

I expect 5-7 runners to spin up to pick up the jobs.

Additional context

We can see default Ubuntu 20.04 is installing docker 26 in the ubuntu runner logs:

Start-Date: 2024-06-01  23:36:15
Commandline: apt-get install -y -qq docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-plugin
Requested-By: ubuntu (1000)
Install: slirp4netns:amd64 (0.4.3-1, automatic), containerd.io:amd64 (1.6.32-1), docker-ce-rootless-extras:amd64 (5:26.1.3-1~ubuntu.20.04~focal), docker-buildx-plugin:amd64 (0.14.0-1~ubuntu.20.04~focal), pigz:amd64 (2.4-1, automatic), docker-compose-plugin:amd64 (2.27.0-1~ubuntu.20.04~focal), docker-ce:amd64 (5:26.1.3-1~ubuntu.20.04~focal), docker-ce-cli:amd64 (5:26.1.3-1~ubuntu.20.04~focal)
End-

I tried specifying docker 20, 24, and removing the variable so the default 18 would install, but I don't see it happening.

docker_version = "public.ecr.aws/docker/library/docker:20"

Terraform values:

variable "max_jobs" {
  description = "Number of jobs which can be processed in parallel by the Runner Worker."
  type        = string
  default     = "2"
}

variable "autoscaling_periods" {
  description = "A list of strings representing the periods when the autoscaling should be active."
  type        = list(string)
  default     = ["* * * * * * *"]
}

variable "autoscaling_idle_count" {
  description = "The number of idle runners to keep before scaling down."
  type        = number
  default     = 1
}

variable "autoscaling_idle_scale_factor" {
  description = "The factor by which to scale down the number of runners when idle."
  type        = number
  default     = 1.0
}

variable "autoscaling_idle_count_min" {
  description = "The minimum number of idle runners to keep."
  type        = number
  default     = 1
}

variable "autoscaling_idle_time" {
  description = "The amount of time a runner can be idle before being considered for scaling down."
  type        = number
  default     = 600

config.toml

    [runners.docker]
    disable_cache = false
    image = "public.ecr.aws/docker/library/docker:20"
    privileged = true
    pull_policy = ["always"]
    shm_size = 0
    tls_verify = false
    volumes = ["/cache","/certs/client"]

  [runners.machine]
    IdleCount = 0
    IdleTime = 600

    MachineDriver = "amazonec2"
    MachineOptions = [
     <redacted>
      "amazonec2-request-spot-instance=false",
      ,"amazonec2-metadata-token=required", "amazonec2-metadata-token-response-hop-limit=2",
    ]
    MaxGrowthRate = 5
    [[runners.machine.autoscaling]]
      IdleCount = 10
      IdleCountMin = 1
      IdleTime = 600
      Periods = ["* * * * * * *"]
      Timezone = "America/Los_Angeles"

Let me know any more info to supply and what else I can try.

The text was updated successfully, but these errors were encountered:

kayman-mk · 2024-06-02T05:41:12Z

Did this happen after upgrading to 7.7 or did you set up a new runner from scratch?

joerawr · 2024-06-02T06:23:20Z

Great question! I should have addressed this.

Fresh install. I have 3 isolated installs running, each with the same behavior.

We are migrating away from 6.5.2, where that was configured as a single instance attached at the group level (7 groups, so 7 isolated runners). We've been using the Cattle Ops terraform for almost 2 years, but this is our first try at using it with autoscaling.

kayman-mk · 2024-06-03T18:54:29Z

So the difference between the installations is the autoscaling, right? Everything else remained the same, especially the images used for the runner machine and the workers?

joerawr · 2024-06-03T22:37:15Z

VPCs. IAM and security groups stayed the same. Other than that, quite different.

For our 6.5.x runners, we are using Method 3 from the README:

GitLab Ci docker runner

In this scenario not docker machine is used but docker to schedule the builds. Builds will run on the same EC2 instance as the agent. No auto-scaling is supported.

So we only have the Amazon Linux instance.

For 7.7.0 we started with examples/runner-pre-registered/main.tf , and commented out all the network and security groups, since those are already defined.

The issue we are seeing is the runner manager will spin up as many Ubuntu worker instances as we specify in IdleCountMin, but no more than that.

      IdleScaleFactor = 1.0
      idle_count = 10
      IdleCountMin = 2

As a test I specified in the main.tf to use ubuntu 18.04 and now docker-machine ls no longer has errors, and we can see that docker 24 is installed:

  runner_worker_docker_machine_ami_filter = {
    name = ["ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*"]
  }

# docker-machine ls
NAME                                              ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-dibjhdrwq-sre-runner-1717450544-2d6579f7   -        amazonec2   Running   tcp://10.0.0.143:2376           v24.0.2
runner-dibjhdrwq-sre-runner-1717450544-f4fa850e   -        amazonec2   Running   tcp://10.0.0.11:2376            v24.0.2

However, I still don't see auto-scaling. Just the number of idle runners. That eliminates my thought that it was docker-machine errors causing the lack of autoscaling.

How does the runner manager determine when to spin up new workers?

I attached a snippet of the logs, and we can see jobs in queue for 500 and 700 seconds, and gitlab-runner "Using existing docker-machine":
runner-logs.txt

joerawr · 2024-06-04T23:25:46Z

I made progress on this. The max_jobs variable was limiting the number of Runners the manager would spin up.

from the module's variables.tf

variable "runner_worker" {
description = <<-EOT
For detailed information, check https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section.
environment_variables = List of environment variables to add to the Runner Worker (environment).
max_jobs = Number of jobs which can be processed in parallel by the Runner Worker.

I had max_jobs set to 2 and what I saw was only two runners and two jobs would be handled at a time for a pipeline. Changing this to zero, and now the runners autoscale. 12 jobs in parallel trigger trigger creation of 12 runners.

I am confused by the wording here in the variables.tf file. This setting seems to limit the number of workers that are spun up when autoscaling is used. This max_jobs value seems to map back to the limit setting in the Gitlab docker-machine docs:

"Limit how many jobs can be handled concurrently by this registered runner. 0 (default) means do not limit."

This is what I wanted, to limit each runner to run 2 jobs in parallel.

However: "View how this setting works with the Docker Machine executor (for autoscaling)." makes it clear the behavior I have been seeing:

To limit the number of virtual machines (VMs) created by the Docker Machine executor, use the limit parameter in the [[runners]] section of the config.toml file.

Perhaps the description of max_jobs can be clarified in the variables.tf file?

Next I am trying to understand the difference between idle_count in these two settings:

runner_worker_docker_machine_instance = {
idle_count = 3 # idle_count = Number of idle Runner Worker instances (not working for the Docker Runner Worker) (IdleCount).
}

and
runner_worker_docker_machine_autoscaling_options = [
{
idle_count = 0
periods = ["* * * * * * *"]
}

Seems that idle_count in runner_worker_docker_machine_autoscaling_options is the one that controls how many idle runners, yes?

LTegtmeier · 2024-08-07T22:32:32Z

We're seeing the same error message in docker-machine ls. Our runner is able to scale up and down. I can see the number of instances changing in the EC2 console. But, the runner is printing out these errors:

{"level":"error","msg":"UnsupportedOperation: You can't stop the Spot Instance 'i-Y' because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.","name":"runner-X","operation":"stop","time":"2024-08-07T22:13:35Z"}
{"level":"error","msg":"\tstatus code: 400, request id: 69f08609-a312-4342-88ed-ebd35e30f97c","name":"runner-X","operation":"stop","time":"2024-08-07T22:13:35Z"}
{"error":"read |0: file already closed","level":"warning","msg":"Problem while reading command output","time":"2024-08-07T22:13:35Z"}
{"error":"exit status 1","level":"warning","lifetime":1365297031454,"msg":"Error while stopping machine","name":"runner-X","reason":"too many idle machines","time":"2024-08-07T22:13:35Z","used":336125,"usedCount":5}

Is it related to this issue?

How does docker-machine install docker on the new nodes? Can we control whether it installs a version that is incompatible (27.1.1 in our case)?

kayman-mk · 2024-08-10T08:01:43Z

@LTegtmeier Do you use your own AMIs for the Runner and the Workers? Can you try the default ones?

I think, docker machine is using a SSH connection to the worker and installs all per-requisites before starting the pipeline.

LTegtmeier · 2024-08-12T20:25:58Z

Can you try the default ones?

~~We use the default filters for both types with version 7.9.0 of the module.~~

Looking again, the default changed since I last copied it into a parameter. I did that to make it easier to toggle between AMD and ARM. These runners are AMD and the runner AMI uses the older amzn2-ami-hvm-2.*-arm64-gp2 filter. The worker filter is the default ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*.

We end up with ami-0a2dd45de938754ee for the runner and ami-0f2175c525a037449 for the workers.

is using a SSH connection to the worker and installs all per-requisites before starting the pipeline.

That's what it looks like. If I start the AMI on a test EC2, Docker isn't installed at all. Docker Machine correctly installs it but, at a version that's not fully compatible with the docker machine version.

LTegtmeier · 2024-10-16T14:36:02Z

We never solved this issue with Docker Machine or understood the root cause. We moved to the docker-autoscaler executor.

kayman-mk · 2024-10-21T12:50:06Z

@LTegtmeier I have fleeting enabled here. Could you please try with these AMIs?

Runner: ami-00f07845aed8c0ee7
Worker: ami-02c93b9f4cd7656e4

The module configuration looks pretty straight forward (some of the options removed):

  runner_worker_docker_machine_fleet = {
    enable = true
  }
  
  runner_worker_docker_machine_instance = {
    name_prefix              = "${var.runner_settings.runner_name}-${each.value.availability_zone}"
    types                    = var.runner_settings.worker_instance_types
  }

  runner_worker = {
    ssm_access            = true
    request_concurrency   = 1
    type                  = "docker+machine"
  }

github-actions · 2024-12-21T02:57:39Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions bot added the stale Issue/PR is stale and closed automatically label Aug 4, 2024

github-actions bot removed the stale Issue/PR is stale and closed automatically label Aug 8, 2024

github-actions bot added the stale Issue/PR is stale and closed automatically label Oct 12, 2024

github-actions bot removed the stale Issue/PR is stale and closed automatically label Oct 17, 2024

cattle-ops deleted a comment from github-actions bot Oct 21, 2024

github-actions bot added the stale Issue/PR is stale and closed automatically label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners not scaling, and "Unable to query docker version" #1134

Runners not scaling, and "Unable to query docker version" #1134

joerawr commented Jun 2, 2024 •

edited

Loading

kayman-mk commented Jun 2, 2024

joerawr commented Jun 2, 2024

kayman-mk commented Jun 3, 2024

joerawr commented Jun 3, 2024 •

edited

Loading

joerawr commented Jun 4, 2024 •

edited

Loading

LTegtmeier commented Aug 7, 2024 •

edited

Loading

kayman-mk commented Aug 10, 2024

LTegtmeier commented Aug 12, 2024 •

edited

Loading

LTegtmeier commented Oct 16, 2024

kayman-mk commented Oct 21, 2024

github-actions bot commented Dec 21, 2024

Runners not scaling, and "Unable to query docker version" #1134

Runners not scaling, and "Unable to query docker version" #1134

Comments

joerawr commented Jun 2, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

kayman-mk commented Jun 2, 2024

joerawr commented Jun 2, 2024

kayman-mk commented Jun 3, 2024

joerawr commented Jun 3, 2024 • edited Loading

joerawr commented Jun 4, 2024 • edited Loading

LTegtmeier commented Aug 7, 2024 • edited Loading

kayman-mk commented Aug 10, 2024

LTegtmeier commented Aug 12, 2024 • edited Loading

LTegtmeier commented Oct 16, 2024

kayman-mk commented Oct 21, 2024

github-actions bot commented Dec 21, 2024

joerawr commented Jun 2, 2024 •

edited

Loading

joerawr commented Jun 3, 2024 •

edited

Loading

joerawr commented Jun 4, 2024 •

edited

Loading

LTegtmeier commented Aug 7, 2024 •

edited

Loading

LTegtmeier commented Aug 12, 2024 •

edited

Loading