Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"No runner will be created, job is not queued." is causing cascading issues across my infrastructure #4262

Open
ay0o opened this issue Nov 15, 2024 · 14 comments

Comments

@ay0o
Copy link

ay0o commented Nov 15, 2024

I have been detecting that occasionally, as in 200 times in the past 6 hours for example, the scale up is returning the No runner will be created, job is not queued. message.

This is an issue because I'm using ephemeral runners, so, if a runner is not created for a given job (A), it keeps waiting until a runner with the same labels is available. However, if a runner with such labels is available, it means that it was actually created by another job (B), so either A or B will run but the other will not. This means that for each runner that was not created, this issue gets exponentially worse.

It's happening in all my instances of this module (37 right now) and in all configurations (2-8 per module), be it spot or on-demand. The common factor is that everything is ephemeral.

An example of a runner config is below:

    "default" = {
      matcherConfig = {
        labelMatchers = [["self-hosted", var.project, "default"]]
        exactMatch    = true
      }
      runner_config = {
        delay_webhook_event                  = 0
        enable_ephemeral_runners             = true
        enable_job_queued_check              = true
        minimum_running_time_in_minutes      = 5
        enable_organization_runners          = true
        enable_on_demand_failover_for_errors = ["InsufficientInstanceCapacity"]
        instance_allocation_strategy         = "price-capacity-optimized"
        instance_target_capacity_type        = "spot"
        instance_types                       = ["m6i.large", "m5.large"]
        runner_architecture                  = "x64"
        runner_as_root                       = true
        runner_extra_labels                  = [var.project, "default"]
        runner_group_name                    = var.project
        runner_os                            = "linux"
        runners_maximum_count                = 30
        userdata_post_install                = "docker login -u ${local.docker_hub_user} -p ${local.docker_hub_password}"
      }
    }
@npalm
Copy link
Member

npalm commented Nov 15, 2024

We run also with only ephemeral runners. With with our small pools we had problems like hanging jobs when low traffic. For that reasson a job-retry feature is introuced. Tis will retry to run a job. You could maybe consideer to disalbe the job queued check as well. This will just create a reunner for each event.

Onn pattern causing problems as well is matrix jobs where max parallel is set, GitHub fires all the events at once. Hence runners get created. But when ready to run the runners are gone. This is cause by the fact GitHub is makring jobs queued when they are not queued. This issue is reported as well to GitHub.

Besides this part above, I have no other ideas. The module is open source, including with a slack channel. You can try to post on the slack channel the issue as well. Maybe someone has an idea.

@ay0o
Copy link
Author

ay0o commented Nov 18, 2024

It's definitively not related to max-parallel.

About disabling the queue check, in the documentation and the ephemeral example, it's mentioned that we must set the queue to true to use JIT. In fact, it also says "By default JIT confiugration is enabled for ephemeral runners [...]", so even setting it to true would be just redundant.

But anyway, for testing's sake, I set enable_job_queued_check = false and what I'm seeing is that, yes, the jobs eventually finish, but it takes a long time to pick a runner. Rarely below 5 minutes, normally around 10 minutes, and in some occasions, even beyond 20+ minutes.

@ay0o
Copy link
Author

ay0o commented Nov 18, 2024

by the way, @npalm notice that I'm using just ephemeral. As you have mentioned in other issues, you also use a pool. By using the pool, you will not likely see this issue because if a runner wasn't created for a job, it will just pick one from the pool.

I'm not interested in using a pool for cost-saving reasons. An EC2 should exist just as long as it's running a job or the minimum running time (5 minutes) hasn't passed.

@npalm
Copy link
Member

npalm commented Nov 19, 2024

Just to clarify. We had similar problems for our small runnger goups (where we have no pools). I miss used the word pool here. For those runner groups we have enabled the job retry check. This ensures events are missed will be picked up. All our runners are ephemeral. Linux start time is about 45 sec (own AMI). Job start time varies from 0 to about 4 minutes. On avarage below 2.

  • enable_job_queued_check - when true, each event leads to a runner. What you need when you have ephemeral and no pool
  • JIT is designed for epehemral so enabled indeed by default
  • To improve boot time we use own AMI's

Finaly you can enable tracing on the lambda's so you can see the time it takes the lambda's to get an EC2 machine ready.

@ay0o
Copy link
Author

ay0o commented Nov 19, 2024

enable_job_queued_check - when true, each event leads to a runner. What you need when you have ephemeral and no pool

This is the issue. When this is true, if I create 10 jobs, most likely 1-2 will get stuck looking for a runner, and if I check the logs of the scale-up, I will see the No runner will be created, job is not queued. even when the workflow_job.queued event is detected by the webhook.

If I set it to false, what I'm seeing is that jobs eventually finish, but the problem is that starting (as in, a runner was created and assigned to the job) might take between 2 and 20+ minutes. I don't know what's the logic behind this because I'm testing in an isolated environment just to be sure that no other jobs are creating runners with these tags or that runners created by these test jobs are not picked by another job.

About the retry, it says to be experimental and that can change at any moment, so that's a big no for my case.

@ay0o
Copy link
Author

ay0o commented Nov 19, 2024

enable_job_queued_check = var.enable_job_queued_check == null ? !var.enable_ephemeral_runners : var.enable_job_queued_check

does this line make sense? As you said, if ephemeral runners are used, the job queued check should be true, but here you are saying that if ephemeral are used and the variable was null, set it to false. Shouldn't it be the other way around?

enable_job_queued_check = var.enable_job_queued_check == null ? var.enable_ephemeral_runners : var.enable_job_queued_check

@npalm
Copy link
Member

npalm commented Nov 19, 2024

If I remember correctly the intent is if the job queue check is not set and ephemeral is configured the job check should not be applied (unless explicit set). This ensures each events is resulting in a new instance.

@ay0o
Copy link
Author

ay0o commented Nov 20, 2024

when you say that disabling check causes each event to create a new instance, what do you really mean? As far as I see, it doesn't matter whether the check is enabled or not, the same number of instances is created (one per job).

@ay0o
Copy link
Author

ay0o commented Nov 22, 2024

one way to easily reproduce this is configuring a runner like the one I mentioned above, ephemeral with no pool (i.e. every job needs to create a runner) and then create a workflow with 10 parallel jobs.

Upon running the job, it's very probable that 1-2 jobs will not run because the runner was not created.

I have also tested that it's not related to spot instances. Using spot might delay a little the availability of a runner, because the spot request may fail so it falls back to on-demand, but that's it.

@muzfuz
Copy link

muzfuz commented Nov 22, 2024

I can confirm this happens with the following conditions:

  • Ephemeral
  • No pool
  • A matrix of 10 or so parallel jobs

I have written a simple test for our internal system to figure out how to work around this.

  SpinUpTwentyInstances:
    runs-on: [$self-hosted-ephemeral-instance-no-pool]
    strategy:
      matrix:
        job:
          [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    steps:
      - uses: actions/checkout@v4
      - name: Run a Command
        run: |
          echo "Running job in instance ${{ matrix.job }}"

However in our case we have an interesting situation because our configuration looks like this:

  enable_ephemeral_runners = true
  enable_job_queued_check  = false
  pool_config = [{
    size                = 14
    schedule_expression = "cron(* 8-20 ? * MON-FRI *)"
  }]

So, this means that:

  • During the "work hours" when the pool is available, running a matrix of jobs is absolutely no problem.
  • Outside "work hours", running a matrix of jobs fails.

The simple answer would be to always keep the pool up and running, but that seems wasteful.

I am currently investigating a better solution, including experimenting with a retry config.

@muzfuz
Copy link

muzfuz commented Nov 25, 2024

Quick update, adding a simple retry does effectively mitigate this.

  job_retry = {
    enable           = true
    max              = 3
    delay_in_seconds = 15
    delay_backoff    = 3
  }

I appreciate this is not a technically "correct" solution, but it appears that GitHub is the limiting factor here, so little we can do other than just try again.

@ay0o
Copy link
Author

ay0o commented Nov 25, 2024

@muzfuz why do you say this is on GitHub's side? As far as I see, GitHub is sending the events as expected, so it's up to whatever is receiving those events (i.e. this module) to react to those events.

Even in the webhook log group, you can see that the events have reached this module and dispatched to the scale-up. It's in the scale-up where things go south.

Anyway, if the job_retry seems to fix this, then the follow-up is... @npalm can we trust it? Because the description is clear:

"Experimental! Can be removed / changed without trigger a major release. Configure job retries. The configuration enables job retries (for ephemeral runners). After creating the insances a message will be published to a job retry queue. The job retry check lambda is checking after a delay if the job is queued. If not the message will be published again on the scale-up (build queue). Using this feature can impact the reate limit of the GitHub app."

@muzfuz
Copy link

muzfuz commented Nov 25, 2024

FYI from @npalm 's comment above

Onn pattern causing problems as well is matrix jobs where max parallel is set, GitHub fires all the events at once. Hence runners get created. But when ready to run the runners are gone. This is cause by the fact GitHub is makring jobs queued when they are not queued. This issue is reported as well to GitHub.

@ay0o
Copy link
Author

ay0o commented Nov 25, 2024

I wouldn't be so sure about this for two reasons:

  • This is not related to parallel jobs, even less to max-parallel configuration. You can easily reproduce this as well firing multiple jobs from different repositories.
  • We have the minimum_running_time_in_minutes parameter to precisely prevent a runner from being terminated.

But well, I could be missing something, so I would appreciate if you can link to that reported issue to GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants