Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner.Listener failing to detect queued workflows/jobs for Enterprise runners #1059

Closed
kathleenfrench opened this issue Apr 23, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@kathleenfrench
Copy link

kathleenfrench commented Apr 23, 2021

Describe the bug

When configuring self-hosted runners at the Enterprise level, if a workflow or job is kicked off prior to a new worker coming online, that workflow and/or job remains indefinitely queued, even after runners successfully come online and are listening for jobs (and those runners work as expected if a workflow/job is kicked off when they're online).

To Reproduce
Steps to reproduce the behavior:

  1. Trigger more than one workflow, or a workflow with more than 1 job when fewer runners are currently online than the sum of tasks.
  2. See that the only jobs that run are those which were immediately assigned to an online runner, despite other online runners becoming available (within 10 seconds of the original trigger).
    • Confirmed that when hitting the Github API directly all of the runner pods that are up and listening for jobs after a scale up is triggered are reflected in the response, and that data reflects the expected number of 'online' = true and 'busy' = false runners that ought to be picking up the queued jobs.
    • The actions-runner-controller used in kubernetes triggers scale-ups based on the percentage of workers that are 'busy' as per that Github API response, and autoscaling is working as expected.
    • If an entirely new workflow is triggered after these autoscaled runners have come up, those new jobs are detected by the runners (as expected), but the original jobs remain queued until the end-user cancels the entire affected workflow(s).
    • Worth noting that the controller also successfully handles scaling down the runners and is able to de-register them without any issues, specifically that the runners are removed as you would expect and don't show as 'offline'.
  3. There is no error surfacing from the runner logs themselves:
Starting Runner listener with startup type: service
Started listener process
Started running service

√ Connected to GitHub

2021-04-23 01:45:54Z: Listening for Jobs

Scenario:

  • 1 workflow with 4 jobs is triggered when only 1 runner pod is up.
  • Only 1 job begins to run, remaining 3 are queued
  • Runners scale up to 4 pods whose logs indicate they are listening for jobs
  • Response from /api/v3/enterprises/[name]/actions/runners returns the correct number of available self-hosted runners at the enterprise level, but the 3 queued jobs never run
See API Response
{
    "total_count": 4,
    "runners": [
        {
            "id": 412,
            "name": "ghe-global-runner-deployment-v7jfr-4qvtd",
            "os": "linux",
            "status": "online",
            "busy": true,
            "labels": [
                {
                    "id": 1,
                    "name": "self-hosted",
                    "type": "read-only"
                },
                {
                    "id": 2,
                    "name": "Linux",
                    "type": "read-only"
                },
                {
                    "id": 3,
                    "name": "X64",
                    "type": "read-only"
                },
                {
                    "id": 4,
                    "name": "basic",
                    "type": "custom"
                }
            ]
        },
        {
            "id": 416,
            "name": "ghe-global-runner-deployment-v7jfr-cxwt4",
            "os": "linux",
            "status": "online",
            "busy": false,
            "labels": [
                {
                    "id": 1,
                    "name": "self-hosted",
                    "type": "read-only"
                },
                {
                    "id": 2,
                    "name": "Linux",
                    "type": "read-only"
                },
                {
                    "id": 3,
                    "name": "X64",
                    "type": "read-only"
                },
                {
                    "id": 4,
                    "name": "basic",
                    "type": "custom"
                }
            ]
        },
        {
            "id": 417,
            "name": "ghe-global-runner-deployment-v7jfr-j5svj",
            "os": "linux",
            "status": "online",
            "busy": false,
            "labels": [
                {
                    "id": 1,
                    "name": "self-hosted",
                    "type": "read-only"
                },
                {
                    "id": 2,
                    "name": "Linux",
                    "type": "read-only"
                },
                {
                    "id": 3,
                    "name": "X64",
                    "type": "read-only"
                },
                {
                    "id": 4,
                    "name": "basic",
                    "type": "custom"
                }
            ]
        },
        {
            "id": 418,
            "name": "ghe-global-runner-deployment-v7jfr-ks8g8",
            "os": "linux",
            "status": "online",
            "busy": false,
            "labels": [
                {
                    "id": 1,
                    "name": "self-hosted",
                    "type": "read-only"
                },
                {
                    "id": 2,
                    "name": "Linux",
                    "type": "read-only"
                },
                {
                    "id": 3,
                    "name": "X64",
                    "type": "read-only"
                },
                {
                    "id": 4,
                    "name": "basic",
                    "type": "custom"
                }
            ]
        }
    ]
}

Expected behavior

After a scale up is triggered to spawn enough runners to satisfy those waiting in the queue, queued workflows (and jobs within workflows where 1 or more job(s) are already running) from prior to the scale-up will be dispatched to the now available runners.

Runner Version and Platform

Version of your runner? 2.278.0

OS of the machine running the runner? OSX/Windows/Linux/...

We are running actions as a service in kubernetes using actions-runner-controller which in turn creates runners with Linux x86_64, Ubuntu 20.04.2 LTS base images (as of writing).

What's not working?

In the /runner/_diag logs, there are some discrepancies between the logs on runners where jobs are picked up versus those that aren't. No Worker_ logs are ever created on newly spawned runners that fail to detect queued jobs, so those jobs are not getting dispatched (unless a new workflow/job is triggered after they have already been registered).

There also appear to be some authentication errors being thrown for failed requests to a visualstudio.com endpoint. I'm not sure what that could be for, perhaps worth noting the log storage integration we're using is AWS, as the only other reference to that error I've been able to find comes from some issues filed in Azure op repositories.

Authentication 401 Error in `Runner_` logs
[2021-04-23 01:29:53Z INFO MessageListener] Attempt to create session.
[2021-04-23 01:29:53Z INFO MessageListener] Connecting to the Runner Server...
[2021-04-23 01:29:53Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[2021-04-23 01:29:53Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-04-23 01:29:53Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[2021-04-23 01:29:53Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-04-23 01:29:53Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[2021-04-23 01:29:53Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-04-23 01:29:54Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-04-23 01:29:54Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-04-23 01:29:54Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-04-23 01:29:54Z INFO MessageListener] VssConnection created
[2021-04-23 01:29:54Z INFO Terminal] WRITE LINE:
[2021-04-23 01:29:54Z INFO Terminal] WRITE LINE:
[2021-04-23 01:29:54Z WARN GitHubActionsService] Authentication failed with status code 401.
Date: Fri, 23 Apr 2021 01:29:54 GMT
Server: Kestrel
Transfer-Encoding: chunked
WWW-Authenticate: Bearer
x-tfs-processid: c6cfb4d1-ddfd-486a-a15e-9cb06bc5a0c2
activityid: 8237d843-d988-4a0b-8128-d4aab9891dc1
x-tfs-session: d56b82fc-e099-4634-a998-1d1ec9a2ae77
x-vss-e2eid: af199aa5-8c5a-405d-9725-975b8402262a
x-vss-senderdeploymentid: [redacted out of caution]
x-tfs-serviceerror: The+user+%27System%3aPublicAccess%3baaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa%27+is+not+authorized+to+access+this+resource.
x-vss-s2stargetservice: 0000005A-0000-8888-8000-000000000000/visualstudio.com

[2021-04-23 01:29:55Z INFO RSAFileKeyManager] Loading RSA key parameters from file /runner/.credentials_rsaparams
[2021-04-23 01:29:55Z INFO RSAFileKeyManager] Loading RSA key parameters from file /runner/.credentials_rsaparams
[2021-04-23 01:29:55Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2021-04-23 01:29:55Z INFO MessageListener] Session created.
[2021-04-23 01:29:55Z INFO Terminal] WRITE LINE: 2021-04-23 01:29:55Z: Listening for Jobs
[2021-04-23 01:29:55Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.

The only other error in any of the logs that I've been able to find on the runner side are coming from a permissions error when writing to /proc/62/oom_score_adj. The runner does not run as root, but does have sudo privileges, but there are no issues running workflows once they're detected, nor have we had any problems in terms of directory permissions for workspaces.

System.IO.IOException in `Runner_` logs
[2021-04-23 01:29:50Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[2021-04-23 01:29:50Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-04-23 01:29:50Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[2021-04-23 01:29:50Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-04-23 01:29:50Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[2021-04-23 01:29:50Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-04-23 01:29:50Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-04-23 01:29:50Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-04-23 01:29:50Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-04-23 01:29:51Z INFO Terminal] WRITE LINE:
[2021-04-23 01:29:51Z INFO ConfigurationManager] Test Connection complete.
[2021-04-23 01:29:51Z INFO HostContext] Well known directory 'Bin': '/runner/bin'
[2021-04-23 01:29:51Z INFO HostContext] Well known directory 'Root': '/runner'
[2021-04-23 01:29:51Z INFO HostContext] Well known config file 'RSACredentials': '/runner/.credentials_rsaparams'
[2021-04-23 01:29:51Z INFO RSAFileKeyManager] Creating new RSA key using 2048-bit key length
[2021-04-23 01:29:51Z INFO RSAFileKeyManager] Successfully saved RSA key parameters to file /runner/.credentials_rsaparams
[2021-04-23 01:29:51Z INFO RSAFileKeyManager] Which: 'chmod'
[2021-04-23 01:29:51Z INFO RSAFileKeyManager] Location: '/usr/bin/chmod'
[2021-04-23 01:29:51Z INFO HostContext] Well known directory 'Bin': '/runner/bin'
[2021-04-23 01:29:51Z INFO HostContext] Well known directory 'Root': '/runner'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] Starting process:
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   File name: '/usr/bin/chmod'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Arguments: '600 /runner/.credentials_rsaparams'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Working directory: '/runner'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Require exit code zero: 'False'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'False'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper]   High priority process: 'False'
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 62.
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/62/oom_score_adj' is denied.
 ---> System.IO.IOException: Permission denied
   --- End of inner exception stack trace ---
   at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
   at System.IO.FileStream.OpenHandle(FileMode mode, FileShare share, FileOptions options)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
   at System.IO.StreamWriter.ValidateArgsAndOpenPath(String path, Boolean append, Encoding encoding, Int32 bufferSize)
   at System.IO.StreamWriter..ctor(String path)
   at System.IO.File.WriteAllText(String path, String contents)
   at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] Process started with process id 62, waiting for process exit.
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-04-23 01:29:51Z INFO ProcessInvokerWrapper] Finished process 62 with exit code 0, and elapsed time 00:00:00.0151741.
[2021-04-23 01:29:51Z INFO RSAFileKeyManager] Successfully set permissions for RSA key parameters file /runner/.credentials_rsaparams
Starting fragment of `Worker_` logs from a runner that _does_ successfully pick up a job in the message queue
$ cat Worker_20210422-232436-utc.log
[2021-04-22 23:24:36Z INFO HostContext] No proxy settings were found based on environmental variables (http_proxy/https_proxy/HTTP_PROXY/HTTPS_PROXY)
[2021-04-22 23:24:36Z INFO HostContext] Well known directory 'Bin': '/runner/bin'
[2021-04-22 23:24:36Z INFO HostContext] Well known directory 'Root': '/runner'
[2021-04-22 23:24:36Z INFO HostContext] Well known config file 'Credentials': '/runner/.credentials'
[2021-04-22 23:24:36Z INFO Worker] Version: 2.278.0
[2021-04-22 23:24:36Z INFO Worker] Commit: 62d926efce35d3ea16d7624a25aaa5b300737def
[2021-04-22 23:24:36Z INFO Worker] Culture:
[2021-04-22 23:24:36Z INFO Worker] UI Culture:
[2021-04-22 23:24:36Z INFO Worker] Waiting to receive the job message from the channel.
[2021-04-22 23:24:36Z INFO Worker] Message received.
[2021-04-22 23:24:36Z INFO Worker] Job message:'

--- omitted the rest, can include more if needed

Job Log Output

If applicable, include the relevant part of the job / step log output here. All sensitive information should already be masked out, but please double-check before pasting here.

N/A as those workflows that do run perform as expected.

Runner and Worker's Diagnostic Logs

If applicable, add relevant diagnostic log information. Logs are located in the runner's _diag folder. The runner logs are prefixed with Runner_ and the worker logs are prefixed with Worker_. Each job run correlates to a worker log. All sensitive information should already be masked out, but please double-check before pasting here.

Posted above in the 'What's not working?' section


Please let me know if I can provide any more info on my end! Thanks 👍🏻

@kathleenfrench kathleenfrench added the bug Something isn't working label Apr 23, 2021
@TingluoHuang
Copy link
Member

The service is trying to find an available runner when the job is scheduled to run base on a certain run.

https://docs.github.com/en/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow#routing-precedence-for-self-hosted-runners

in your case, the additional job probably gets queued to a different level, repo or org base these rules.

We are going to work on improving the auto-scale experience for the actions runner.

I am going to close this issue for now since it's not really scoped to the runner repo (the service in charge to decide which runner takes which job).

Feel free to report this issue at https://github.saobby.my.eu.orgmunity/c/code-to-cloud/github-actions/41

@kathleenfrench
Copy link
Author

for those who end up here, here's my post in the github community forum on the subject: https://github.saobby.my.eu.orgmunity/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348

@hross
Copy link
Contributor

hross commented Apr 28, 2021

I replied in the forum. Appreciate you writing this up twice. TLDR: yes we would like to fix this (are fixing it, actually -- since we noticed it as well). It might take some time to roll out since it impacts job assignment which is a pretty critical area of the product to not break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants