-
Notifications
You must be signed in to change notification settings - Fork 936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner.Listener failing to detect queued workflows/jobs for Enterprise runners #1059
Comments
The service is trying to find an available runner when the job is scheduled to run base on a certain run. in your case, the additional job probably gets queued to a different level, repo or org base these rules. We are going to work on improving the auto-scale experience for the actions runner. I am going to close this issue for now since it's not really scoped to the runner repo (the service in charge to decide which runner takes which job). Feel free to report this issue at https://github.saobby.my.eu.orgmunity/c/code-to-cloud/github-actions/41 |
for those who end up here, here's my post in the github community forum on the subject: https://github.saobby.my.eu.orgmunity/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348 |
I replied in the forum. Appreciate you writing this up twice. TLDR: yes we would like to fix this (are fixing it, actually -- since we noticed it as well). It might take some time to roll out since it impacts job assignment which is a pretty critical area of the product to not break. |
Describe the bug
When configuring self-hosted runners at the Enterprise level, if a workflow or job is kicked off prior to a new worker coming online, that workflow and/or job remains indefinitely queued, even after runners successfully come online and are listening for jobs (and those runners work as expected if a workflow/job is kicked off when they're online).
To Reproduce
Steps to reproduce the behavior:
actions-runner-controller
used in kubernetes triggers scale-ups based on the percentage of workers that are 'busy' as per that Github API response, and autoscaling is working as expected.Scenario:
/api/v3/enterprises/[name]/actions/runners
returns the correct number of available self-hosted runners at the enterprise level, but the 3 queued jobs never runSee API Response
Expected behavior
After a scale up is triggered to spawn enough runners to satisfy those waiting in the queue, queued workflows (and jobs within workflows where 1 or more job(s) are already running) from prior to the scale-up will be dispatched to the now available runners.
Runner Version and Platform
Version of your runner?
2.278.0
OS of the machine running the runner? OSX/Windows/Linux/...
We are running actions as a service in kubernetes using actions-runner-controller which in turn creates runners with
Linux x86_64, Ubuntu 20.04.2 LTS
base images (as of writing).What's not working?
In the
/runner/_diag
logs, there are some discrepancies between the logs on runners where jobs are picked up versus those that aren't. NoWorker_
logs are ever created on newly spawned runners that fail to detect queued jobs, so those jobs are not getting dispatched (unless a new workflow/job is triggered after they have already been registered).There also appear to be some authentication errors being thrown for failed requests to a
visualstudio.com
endpoint. I'm not sure what that could be for, perhaps worth noting the log storage integration we're using is AWS, as the only other reference to that error I've been able to find comes from some issues filed in Azure op repositories.Authentication 401 Error in `Runner_` logs
The only other error in any of the logs that I've been able to find on the runner side are coming from a permissions error when writing to
/proc/62/oom_score_adj
. The runner does not run asroot
, but does havesudo
privileges, but there are no issues running workflows once they're detected, nor have we had any problems in terms of directory permissions for workspaces.System.IO.IOException in `Runner_` logs
Starting fragment of `Worker_` logs from a runner that _does_ successfully pick up a job in the message queue
Job Log Output
If applicable, include the relevant part of the job / step log output here. All sensitive information should already be masked out, but please double-check before pasting here.
N/A
as those workflows that do run perform as expected.Runner and Worker's Diagnostic Logs
If applicable, add relevant diagnostic log information. Logs are located in the runner's
_diag
folder. The runner logs are prefixed withRunner_
and the worker logs are prefixed withWorker_
. Each job run correlates to a worker log. All sensitive information should already be masked out, but please double-check before pasting here.Posted above in the 'What's not working?' section
Please let me know if I can provide any more info on my end! Thanks 👍🏻
The text was updated successfully, but these errors were encountered: