Enforce timeout and notify when job is waiting for a runner to pick up the job #50926

radhikari-arch · 2023-03-24T13:49:03Z

radhikari-arch
Mar 24, 2023

Select Topic Area

Question

Body

We have a situations for various reasons the self hosted runners run into issues and jobs are waiting for several hours to pick up the runners with message "Waiting for a runner to pick up this job...". Is there something we can enforce timeout for this? The timeout in job works only when jobs actually started to run. But I am looking for a solution where several jobs if waiting for the runners for sometime would like to timeout and get notified.
And also is there a way we can monitor the queue that are waiting for the runners?

enviGit · 2023-03-24T17:13:43Z

enviGit
Mar 24, 2023

Hi Rajendra!

You can enforce a timeout for jobs that are waiting for a runner to pick up the job by setting the idle_timeout parameter in your self-hosted runner's configuration file.

To set the idle_timeout for the runner, you will need to modify the runner's configuration file. The configuration file is located in the RUNNER_HOME directory on the machine where the runner is installed. You can set the idle_timeout parameter to the number of seconds that you want to allow the runner to be idle before it is automatically removed from the queue.

For example, to set the idle_timeout to 60 minutes (3600 seconds), you can add the following line to your configuration file:

idle_timeout = 3600

When a runner has been idle for the specified amount of time, it will be automatically removed from the queue and the job will be marked as failed.

As for monitoring the queue of jobs waiting for runners, you can use the GitHub API to retrieve a list of pending jobs. The API endpoint for this is:

GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs

This will return a list of jobs for the specified run, including their current status (e.g. waiting, in progress, completed). You can use this information to monitor the queue of jobs waiting for runners and take appropriate action if necessary.

6 replies

enviGit Mar 24, 2023

Thank you for providing more context. In that case, you can add a timeout to your job using the timeout-minutes parameter in your workflow YAML file. This parameter specifies the number of minutes a job is allowed to run before it is automatically cancelled. For example, to set the timeout to 30 minutes, you can add the following line to your YAML file:

jobs:
  your_job_name:
    runs-on: self-hosted
    timeout-minutes: 30
    steps:
      # your job steps here

radhikari-arch Mar 24, 2023
Author

timeout-minutes activates only when job finds a runner and starts running. In my case, job is just waiting for the runner. Example, this job has been waiting for last 6 hours and shows "Waiting for a runner to pick up this job..."

name: Testing timeout

on:
  workflow_dispatch:

jobs:
    job1:
      name: Job Title
      runs-on: [self-hosted, testing]
      timeout-minutes: 5

      steps:
        - name: Echo out the input variables
          run: |
            echo "place holder"

Looking for timing out and not wait several hours before we know this job did not find its runner.

enviGit Mar 24, 2023

You can try to use a combination of the if and needs syntax to implement a timeout for jobs waiting for a runner.

Here is an example:

name: Testing timeout

on:
  workflow_dispatch:

jobs:
  job1:
    name: Job Title
    runs-on: [self-hosted, testing]
    timeout-minutes: 5
    steps:
      - name: Echo out the input variables
        run: |
          echo "place holder"

  job2:
    name: Check for runner
    runs-on: [self-hosted, testing]
    needs: job1
    if: ${{ github.job_status == 'Waiting' }}
    timeout-minutes: 30
    steps:
      - name: Check if runner is available
        run: |
          # Check if runner is available and exit with code 0 if it is
          # Otherwise, exit with a non-zero code to indicate failure
    if: ${{ job2.outcome == 'success' }}

  job3:
    name: Run job after runner is available
    runs-on: [self-hosted, testing]
    needs: job2
    if: ${{ job2.outcome == 'success' }}
    steps:
      - name: Run your job
        run: |
          # Run your job here
    if: ${{ job3.outcome != 'failure' }}

  job4:
    name: Notify if runner is still unavailable
    runs-on: [self-hosted, testing]
    needs: job2
    if: ${{ job2.outcome == 'failure' }}
    steps:
      - name: Send notification
        run: |
          # Send a notification that the runner is still unavailable
          # For example, you could use a Slack or email integration
    if: ${{ job4.outcome != 'failure' }}

In this example, job2 checks for the runner and sets its outcome to success if it's available within the timeout period. If the runner is still unavailable, job2 times out and its outcome is set to failure.

job3 runs your job only if the runner is available (i.e., job2.outcome == success), and job4 sends a notification if the runner is still unavailable (i.e., job2.outcome == failure).

I hope this helps! Let me know if you have any further questions.

radhikari-arch Mar 24, 2023
Author

job2 does not even execute until job1 is complete. And github context does not even have the github.job_status. And context of job seems to be limited within the current running jobs. Jobs can't even look for another job context and won't able to know the status.

JulianJvn Aug 11, 2023

@radhikari-arch Pretty sure you're talking to an AI chatbot there.

develop-at-github · 2023-06-13T08:55:31Z

develop-at-github
Jun 13, 2023

Hi @radhikari-arch,

I have been facing similar problem and created a workaround which is working fine for me so far. It might not be an optimal way, but this is what I did:

Create another workflow, say timeout.yml with following contents

name: Cancel deploy on timeout

on:
    push:
        branches: [main]
    workflow_dispatch:

jobs:
    timeout:
        timeout-minutes: 2
        runs-on: ubuntu-latest
        permissions:
          actions: write
        steps:
            - name: Monitor deployment for timeout and cancel if crossed the threshold
              env:
                GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
              run: |
                sleep 1m
                gh -R {owner}/{repo} run list -w deploy.yml -s queued --json databaseId -q .[].databaseId | xargs gh -R {owner}/{repo} run cancel

How it works?

This new workflow runs along with other workflows, depending on the triggers defined.
It then waits/sleeps for 1 minute (can be changed as per the use case)
Upon waking up, it cancels the deploy.yml workflow if it is still queued.

You can probably use this simple workflow as base and modify as per your needs by applying checks on different statuses and varying timeouts per status.

Hope it helps!

0 replies

colemickens · 2024-01-06T04:56:28Z

colemickens
Jan 6, 2024

Another missing feature that makes me wonder how Actions shipped, let alone has stayed in such a similar state for years.

It's like observability of Actions is actually an anti-feature in Microsoft's eyes. Just like the ridiculous way failed job notifications are sent and are rotely unconfigurable, it's outlandish that there's no way to timeout STALLED JOBS that can't run.

0 replies

ceejatec · 2024-03-16T00:20:39Z

ceejatec
Mar 16, 2024

FYI I adapted @develop-at-github 's suggestion for a scheduled cleanup job that runs hourly. The workflow is in cleanup.yml, and the top of the workflow looks like this:

name: cleanup_self_hosted_agents
on:
  schedule:
    # Run at 10 minutes past every hour
    - cron: '10 * * * *'

jobs:
  # This job cancels any already-queued copies of this workflow. This
  # prevents runaway queues and also cancels any jobs that failed to
  # start due to the corresponding worker being offline, etc.
  prune:
    runs-on: ubuntu-latest
    env:
      GH_TOKEN: ${{ github.token }}
    steps:
      - run: |-
          gh -R couchbasecloud/github-actions-playground run list -w cleanup.yml -s queued --json databaseId -q '.[].databaseId' | sort -nr | tail -n +2 | xargs -r -n 1 gh -R couchbasecloud/github-actions-playground run cancel

That kills any outstanding runs of the workflow, not including (via the tail -n +2) the most recent run which is the run that this job is in. That way the first thing the workflow does is kill any earlier runs that are still queued.

I do run this job on ubuntu-latest so it never gets hung up itself, even if all my self-hosted runners are busy. This eats a few GHA minutes a month, but worth it.

I do agree with @colemickens that this is functionality that should be built in to GHA.

0 replies

2024-05-16T03:28:56Z

github-actions[bot]
bot May 16, 2024

🕒 Discussion Activity Reminder 🕒

This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions:

1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as out of date at the bottom of the page.

2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own.

3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution.

Note: This dormant notification will only apply to Discussions with the Question label. To learn more, see our recent announcement.

Thank you for helping bring this Discussion to a resolution! 💬

0 replies

BenjaminAtExpectedIT · 2024-09-18T10:45:53Z

BenjaminAtExpectedIT
Sep 18, 2024

This timeout probably shouldn't be a workflow yml parameter, but a repository setting. In a repository I am working on, we have a job that is stuck since over 2 days, which we are even unable to cancel using the force cancel api. And it is just waiting for a runner to come online - but the workers are online and other jobs are able to run on them. Also the jobs itself states:
Job is cancelled before starting.
And a cancelled job shouldn't wait for a runner, neither. So having timeouts as a repository wide setting should help with such bugs.

0 replies

vantruongt2 · 2024-10-17T04:17:19Z

vantruongt2
Oct 17, 2024

Actually, I wrote the job to get the status of the self-hosted runner before the next job is executed on the expected runner.

jobs:
    job0:
      name: Get runner status
      runs-on: ubuntu-latest
      outputs:
        status: ${{ steps.get_runner_status.outputs.runner_status }}
      steps:
        - name: Get runner status
          id: get_runner_status
          // Write a script or action to get the runner status

    job1:
      name: Job Title
      needs: job0
      if: ${{ needs.job0.outputs.status == 'online' }}
      runs-on: [self-hosted, testing]
      timeout-minutes: 5

      steps:
        - name: Echo out the input variables
          run: |
            echo "place holder"

2 replies

bnegrao Nov 5, 2024

Can you share your script that verifies if the self-hosted runner is up? i have no idea how that looks like.

vantruongt2 Nov 7, 2024

I wrote my own github action to get the runner status as below

jaredbarranco · 2024-11-19T20:04:33Z

jaredbarranco
Nov 19, 2024

Adding my two cents here. Below is a job definition I use to check named runner status. In my case, we only have two self hosted runners, one for dev, one for prod. If you have more, maybe consider using a matrix to check that all the runners are up. Or, move this into a re-usable action that allows an input param specifying the worker.

This does require an Org admin to grant a Github Token with the correct fine grained access control. I believe its an org read workflow permission.

You could add this to the front of all your actions, then have any subsequent jobs in the workflow use needs: check-runners.

  check-runners:
      runs-on: ubuntu-latest
      outputs:
        dev-runner-up: ${{ steps.check-status.outputs.DEV_RUNNER_UP }}
        prod-runner-up: ${{ steps.check-status.outputs.PROD_RUNNER_UP }}
      steps:
        - name: Checkout repository
          uses: actions/checkout@v4

        - name: Check runner status
          id: check-status
          env:
            GH_TOKEN: ${{ secrets.ORG_GITHUB_TOKEN }}
          run: |
            ORG_NAME="ORG_NAME"

            # Fetch runners data
            response=$(gh api /orgs/$ORG_NAME/actions/runners)
            echo $response | jq '.'
            # Extract runners array
            runners=$(echo "$response" | jq -c '.runners')

            # Initialize flags for runner status
            RUNNER_PROD_UP=false
            RUNNER_DEV_UP=false

            # Check each runner's status
            for runner in $(echo "$runners" | jq -c '.[]'); do
              name=$(echo "$runner" | jq -r '.name')
              status=$(echo "$runner" | jq -r '.status')
              echo "Checking Name: $name with status $status"
              if [[ "$name" == "DEV_RUNNER_NAME_HERE" ]]; then
                if [[ "$status" == "online" ]]; then
                  echo "Dev Runner Online"
                  RUNNER_DEV_UP=true
                fi
              elif [[ "$name" == "PROD_RUNNER_NAME_HERE" ]]; then
                if [[ "$status" == "online" ]]; then
                  echo "Prod Runner Online"
                  RUNNER_PROD_UP=true
                fi
              fi
            done
            echo "Prod Runner Status: $RUNNER_PROD_UP"
            echo "Dev Runner Status: $RUNNER_DEV_UP"
            # Set the outputs for runner status
            echo "DEV_RUNNER_UP=$RUNNER_DEV_UP" >> $GITHUB_OUTPUT
            echo "PROD_RUNNER_UP=$RUNNER_PROD_UP" >> $GITHUB_OUTPUT

        - name: Print runner status
          run: |
            echo "DEV_RUNNER_UP: ${{ steps.check-status.outputs.DEV_RUNNER_UP }}"
            echo "PROD_RUNNER_UP: ${{ steps.check-status.outputs.PROD_RUNNER_UP }}"
              
        - name: Send Slack Alert
          uses: slackapi/slack-github-action@v1.25.0
          if: ${{ steps.check-status.outputs.PROD_RUNNER_UP == 'false' }} 
          with:
            payload: |
                      {
                        "blocks": [
                          {
                            "type": "header",
                            "text": {
                              "type": "plain_text",
                              "text": "Urgent: Github Self Hosted Runner is Down!",
                              "emoji": true
                            }
                          },
                          {
                            "type": "section",
                            "text": {
                              "type": "mrkdwn",
            "text": "The self-hosted Github Actions Runner: RUNNER_NAME_HERE was detected as offline during workflow run: ${{ github.run_id }}. Please contact DevOps to restart the Self-Hosted Github Actions Worker"
                            }
                          }
                        ]
                      }
          env:
            SLACK_WEBHOOK_URL: ${{ secrets.ALERTS_SLACKBOT_WEBHOOK }}
            SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Enforce timeout and notify when job is waiting for a runner to pick up the job #50926

{{title}}

Replies: 8 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Enforce timeout and notify when job is waiting for a runner to pick up the job #50926

Select Topic Area

Body

Replies: 8 comments · 8 replies

Hi Rajendra!

radhikari-arch Mar 24, 2023 Author

radhikari-arch Mar 24, 2023 Author

github-actions[bot] bot May 16, 2024

Replies: 8 comments 8 replies

radhikari-arch Mar 24, 2023
Author

radhikari-arch Mar 24, 2023
Author

github-actions[bot]
bot May 16, 2024