-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"No runner will be created, job is not queued." is causing cascading issues across my infrastructure #4262
Comments
We run also with only ephemeral runners. With with our small pools we had problems like hanging jobs when low traffic. For that reasson a job-retry feature is introuced. Tis will retry to run a job. You could maybe consideer to disalbe the job queued check as well. This will just create a reunner for each event. Onn pattern causing problems as well is matrix jobs where max parallel is set, GitHub fires all the events at once. Hence runners get created. But when ready to run the runners are gone. This is cause by the fact GitHub is makring jobs queued when they are not queued. This issue is reported as well to GitHub. Besides this part above, I have no other ideas. The module is open source, including with a slack channel. You can try to post on the slack channel the issue as well. Maybe someone has an idea. |
It's definitively not related to About disabling the queue check, in the documentation and the ephemeral example, it's mentioned that we must set the queue to true to use JIT. In fact, it also says "By default JIT confiugration is enabled for ephemeral runners [...]", so even setting it to true would be just redundant. But anyway, for testing's sake, I set |
by the way, @npalm notice that I'm using just ephemeral. As you have mentioned in other issues, you also use a pool. By using the pool, you will not likely see this issue because if a runner wasn't created for a job, it will just pick one from the pool. I'm not interested in using a pool for cost-saving reasons. An EC2 should exist just as long as it's running a job or the minimum running time (5 minutes) hasn't passed. |
Just to clarify. We had similar problems for our small runnger goups (where we have no pools). I miss used the word pool here. For those runner groups we have enabled the job retry check. This ensures events are missed will be picked up. All our runners are ephemeral. Linux start time is about 45 sec (own AMI). Job start time varies from 0 to about 4 minutes. On avarage below 2.
Finaly you can enable tracing on the lambda's so you can see the time it takes the lambda's to get an EC2 machine ready. |
This is the issue. When this is true, if I create 10 jobs, most likely 1-2 will get stuck looking for a runner, and if I check the logs of the scale-up, I will see the If I set it to false, what I'm seeing is that jobs eventually finish, but the problem is that starting (as in, a runner was created and assigned to the job) might take between 2 and 20+ minutes. I don't know what's the logic behind this because I'm testing in an isolated environment just to be sure that no other jobs are creating runners with these tags or that runners created by these test jobs are not picked by another job. About the retry, it says to be experimental and that can change at any moment, so that's a big no for my case. |
does this line make sense? As you said, if ephemeral runners are used, the job queued check should be true, but here you are saying that if ephemeral are used and the variable was null, set it to false. Shouldn't it be the other way around?
|
If I remember correctly the intent is if the job queue check is not set and ephemeral is configured the job check should not be applied (unless explicit set). This ensures each events is resulting in a new instance. |
when you say that disabling check causes each event to create a new instance, what do you really mean? As far as I see, it doesn't matter whether the check is enabled or not, the same number of instances is created (one per job). |
one way to easily reproduce this is configuring a runner like the one I mentioned above, ephemeral with no pool (i.e. every job needs to create a runner) and then create a workflow with 10 parallel jobs. Upon running the job, it's very probable that 1-2 jobs will not run because the runner was not created. I have also tested that it's not related to spot instances. Using spot might delay a little the availability of a runner, because the spot request may fail so it falls back to on-demand, but that's it. |
I can confirm this happens with the following conditions:
I have written a simple test for our internal system to figure out how to work around this. SpinUpTwentyInstances:
runs-on: [$self-hosted-ephemeral-instance-no-pool]
strategy:
matrix:
job:
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
steps:
- uses: actions/checkout@v4
- name: Run a Command
run: |
echo "Running job in instance ${{ matrix.job }}" However in our case we have an interesting situation because our configuration looks like this: enable_ephemeral_runners = true
enable_job_queued_check = false
pool_config = [{
size = 14
schedule_expression = "cron(* 8-20 ? * MON-FRI *)"
}] So, this means that:
The simple answer would be to always keep the pool up and running, but that seems wasteful. I am currently investigating a better solution, including experimenting with a retry config. |
Quick update, adding a simple retry does effectively mitigate this. job_retry = {
enable = true
max = 3
delay_in_seconds = 15
delay_backoff = 3
} I appreciate this is not a technically "correct" solution, but it appears that GitHub is the limiting factor here, so little we can do other than just try again. |
@muzfuz why do you say this is on GitHub's side? As far as I see, GitHub is sending the events as expected, so it's up to whatever is receiving those events (i.e. this module) to react to those events. Even in the webhook log group, you can see that the events have reached this module and dispatched to the scale-up. It's in the scale-up where things go south. Anyway, if the
|
FYI from @npalm 's comment above
|
I wouldn't be so sure about this for two reasons:
But well, I could be missing something, so I would appreciate if you can link to that reported issue to GitHub. |
I have been detecting that occasionally, as in 200 times in the past 6 hours for example, the scale up is returning the
No runner will be created, job is not queued.
message.This is an issue because I'm using ephemeral runners, so, if a runner is not created for a given job (
A
), it keeps waiting until a runner with the same labels is available. However, if a runner with such labels is available, it means that it was actually created by another job (B
), so eitherA
orB
will run but the other will not. This means that for each runner that was not created, this issue gets exponentially worse.It's happening in all my instances of this module (37 right now) and in all configurations (2-8 per module), be it spot or on-demand. The common factor is that everything is ephemeral.
An example of a runner config is below:
The text was updated successfully, but these errors were encountered: