Checkout bricks a self-hosted runner and cannot recover #1148

kvanbere · 2023-01-30T22:49:21Z

Something went wrong, and all of our self-hosted runners checked out bad .git folders or somehow corrupted them. It happened on around 13 of our runners at the same time. I think it was a random occurrence, because I had to manually login and delete the repository folder, and then it was fine.

Here are our logs:

2023-01-30T02:56:34.9249114Z Waiting for a runner to pick up this job...
2023-01-30T04:54:24.3969588Z Job is about to start running on the runner: XXXXXXXXXXXXXXXXXXXXXXXX (organization)
2023-01-30T04:54:29.3070556Z Current runner version: '2.301.1'
2023-01-30T04:54:29.3077744Z Runner name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3078128Z Runner group name: 'Default'
2023-01-30T04:54:29.3078642Z Machine name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3080746Z ##[group]GITHUB_TOKEN Permissions
2023-01-30T04:54:29.3081343Z Actions: write
2023-01-30T04:54:29.3081520Z Checks: write
2023-01-30T04:54:29.3081693Z Contents: write
2023-01-30T04:54:29.3081906Z Deployments: write
2023-01-30T04:54:29.3082186Z Discussions: write
2023-01-30T04:54:29.3082429Z Issues: write
2023-01-30T04:54:29.3082608Z Metadata: read
2023-01-30T04:54:29.3082779Z Packages: write
2023-01-30T04:54:29.3082958Z Pages: write
2023-01-30T04:54:29.3083147Z PullRequests: write
2023-01-30T04:54:29.3083476Z RepositoryProjects: write
2023-01-30T04:54:29.3083696Z SecurityEvents: write
2023-01-30T04:54:29.3083888Z Statuses: write
2023-01-30T04:54:29.3084056Z ##[endgroup]
2023-01-30T04:54:29.3087171Z Secret source: Actions
2023-01-30T04:54:29.3087569Z Prepare workflow directory
2023-01-30T04:54:29.4388409Z Prepare all required actions
2023-01-30T04:54:29.4550014Z Getting action download info
2023-01-30T04:54:29.8524043Z Download action repository 'actions/checkout@v3' (SHA:ac593985615ec2ede58e132d2e21d2b1cbd6127c)
2023-01-30T04:54:30.9083915Z Complete job name: XXXXXXXXXXXXXXXXXXXXXXXX
2023-01-30T04:54:31.0985565Z ##[group]Run actions/checkout@v3
2023-01-30T04:54:31.0985877Z with:
2023-01-30T04:54:31.0986059Z   repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:31.0986462Z   token: ***
2023-01-30T04:54:31.0986609Z   ssh-strict: true
2023-01-30T04:54:31.0986786Z   persist-credentials: true
2023-01-30T04:54:31.0986951Z   clean: true
2023-01-30T04:54:31.0987092Z   fetch-depth: 1
2023-01-30T04:54:31.0987234Z   lfs: false
2023-01-30T04:54:31.0987377Z   submodules: false
2023-01-30T04:54:31.0987547Z   set-safe-directory: true
2023-01-30T04:54:31.0987702Z env:
2023-01-30T04:54:31.0987887Z   TMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988151Z   TEMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988398Z   TMPDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988665Z   MATLAB_PREFDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.preferences
2023-01-30T04:54:31.0988870Z ##[endgroup]
2023-01-30T04:54:34.6968863Z Syncing repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:34.6970512Z ##[group]Getting Git version info
2023-01-30T04:54:34.6970936Z Working directory is 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:34.6971402Z [command]"C:\Program Files\Git\cmd\git.exe" version
2023-01-30T04:54:34.7493487Z git version 2.36.1.windows.1
2023-01-30T04:54:34.7592122Z ##[endgroup]
2023-01-30T04:54:34.7607048Z Temporarily overriding HOME='C:\runner\e595c9b9\_work\_temp\bcafa367-f8cb-4d31-84b1-63d10aaaabed' before making global git config changes
2023-01-30T04:54:34.7607516Z Adding repository directory to the temporary git global config as a safe directory
2023-01-30T04:54:34.7608114Z [command]"C:\Program Files\Git\cmd\git.exe" config --global --add safe.directory C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX
2023-01-30T04:54:34.8483251Z [command]"C:\Program Files\Git\cmd\git.exe" config --local --get remote.origin.url
2023-01-30T04:54:34.8992096Z ##[error]fatal: --local can only be used inside a git repository
2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'
2023-01-30T04:54:35.4710729Z Post job cleanup.
2023-01-30T04:54:38.8875206Z Cleaning up orphan processes

In this case, checkout seems to be bailing fatally, i.e. after the error fatal: --local can only be used inside a git repository, the actions run ends immediately with a fault and won't try and continue.

This effectively bricked the runner because any jobs that the bad runner would pick up would fail instantly. Not only that, but the bad runner would take all the jobs in the queue and virtually instantly fail them, which messed up our job history quite a bit unfortunately.

Since the resolution step was simply to login and delete the offending bad folder, it would be nice if it would automatically nuke away the folder and retry once.

It seems like it tried this:

2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'

I am not sure why that didn't work, since I was able to login and just rm the folder fine as the same user. In any case, all 13 runners failed to delete the folder automatically.

To reproduce, I would suggest:

Install self hosted runner on Windows Server 2022 running as a service and using a non-admin service user (i.e. Bob)
Setup action to checkout repository
Manually corrupt the .git folder by adding extra random files into it (?)
Ensure git config --local --get remote.origin.url fails
Observe consequent jobs acquired by this runner will fail instantly and it will fail to recover

The text was updated successfully, but these errors were encountered:

kvanbere · 2023-01-30T23:25:52Z

Depending on how this is addressed, it could also fix other issues i.e: #933 , since that issue with submodule corruption is also fixed by just deleting the repo and allowing the runner to do a fresh clone ( #988 (comment) ).

For example, as a broad workaround it could give up on reusing the existing git repository if any commands throw a fault, and try to delete and checkout the repository from scratch.

olzhas · 2023-03-01T20:58:18Z

Sometime ago there was a fix for this was introduced #964, but it seems it doesn't solve the issue. I might be wrong.

kvanbere · 2023-03-01T22:07:57Z

Sometime ago there was a fix for this was introduced #964, but it seems it doesn't solve the issue. I might be wrong.

We are using checkout v3 and this still seems to be an issue.

tyteen4a03 · 2023-03-28T23:11:56Z

Hi, also running into this issue.

kvanbere · 2023-04-01T04:38:23Z

Does anyone have a workaround for this?

jbaryy708 · 2023-04-01T08:44:06Z

Hi.... how fix if runners please send me your txt....

Ajaydip · 2023-04-01T15:45:20Z

I have been using the following workaround while waiting for the fix:

- name: checkout
  id: checkout
  uses: actions/checkout@v3
  with:
    ref: ${{ inputs.ref }}
    submodules: "recursive"
    token: ${{ secrets.token }}

- name: cleanup runner workspace
  run: |
    echo $GITHUB_WORKSPACE
    rm -rf $GITHUB_WORKSPACE
    mkdir $GITHUB_WORKSPACE
  shell: bash
  if: ${{ failure() && steps.checkout.conclusion == 'failure' }}

This atleast prevents the runner from being bricked if checkout fails either due to corrupted .git folder or bad submodules.

kvanbere · 2023-04-03T05:56:20Z

Good workaround, thanks!

kvanbere · 2023-04-03T05:58:08Z

I just wanted to add that I ran into this one today:

Warning: Unable to clean or reset the repository. The repository will be recreated instead.
Deleting the contents of 'C:\runner\31f270db\_work\aaaa\bbbb'
Error: File was unable to be removed Error: EBUSY: resource busy or locked, rmdir 'C:\runner\31f270db\_work\aaaa\bbbb\work'

It then went ahead and gobbled up all the remaining jobs in the entire queue and failed them all with the same error.

Edit: Seems like the above is an unrelated issue to what is mentioned in the first post, this time there was some random cc1plus process hanging around that had a lock on a directory in the git folder and it seemed to have gotten stuck and was preventing git clean from running. I don't expect the checkout action to hunt down and kill processes, but I think I will fix this with a powershell script.

kvanbere · 2023-08-29T07:55:03Z

Happened again in a big way today :(

kvanbere · 2023-08-29T08:28:25Z

@Ajaydip I tried your workaround and it didn't work for me, it always skips the action?

  run-tests:
    name: xxxx
    runs-on: [self-hosted]
    timeout-minutes: 90
    strategy:
      fail-fast: false
      matrix:
        include: ${{fromJson(needs.scan-tests.outputs.matrix)}}
    steps:
      - uses: actions/checkout@v3
        id: checkout
        timeout-minutes: 10
        continue-on-error: true
      - name: Cleanup previously failed job
        run: |
          Remove-Item "${{env.GITHUB_WORKSPACE}}" -Force -Recurse -ErrorAction SilentlyContinue | Out-Null
          New-Item -ItemType Directory -Force -Path "${{env.GITHUB_WORKSPACE}}" | Out-Null
        if: ${{ steps.checkout.conclusion == 'failure' }}
      - uses: actions/checkout@v3
        if: ${{ steps.checkout.conclusion == 'failure' }}

I did modify it a little bit .. I was hoping to be able to recover and run the rest of the pipeline unaffected without having to put an if: ... on every step.

Edit:
If you've done what I did above, you probably want to use outcome not conclusion -- https://docs.github.com/en/actions/learn-github-actions/contexts#steps-context .

bryanjtc · 2023-11-09T05:44:45Z

Any update on this? Does anyone have a working workaround?

kvanbere · 2023-11-09T06:41:57Z

@bryanjtc the workaround above works OK, just note my Edit about using ‘outcome’ not ‘conclusion’ for testing whether to retry.

This comment was marked as spam.

Sign in to view

megamanics mentioned this issue Mar 8, 2023

Fix: Checkout Issue in self hosted runner due to faulty submodule check-ins #1196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkout bricks a self-hosted runner and cannot recover #1148

Checkout bricks a self-hosted runner and cannot recover #1148

kvanbere commented Jan 30, 2023 •

edited

Loading

kvanbere commented Jan 30, 2023 •

edited

Loading

olzhas commented Mar 1, 2023

This comment was marked as spam.

kvanbere commented Mar 1, 2023

tyteen4a03 commented Mar 28, 2023

kvanbere commented Apr 1, 2023

jbaryy708 commented Apr 1, 2023

Ajaydip commented Apr 1, 2023 •

edited

Loading

kvanbere commented Apr 3, 2023

kvanbere commented Apr 3, 2023 •

edited

Loading

kvanbere commented Aug 29, 2023

kvanbere commented Aug 29, 2023 •

edited

Loading

bryanjtc commented Nov 9, 2023

kvanbere commented Nov 9, 2023

Checkout bricks a self-hosted runner and cannot recover #1148

Checkout bricks a self-hosted runner and cannot recover #1148

Comments

kvanbere commented Jan 30, 2023 • edited Loading

kvanbere commented Jan 30, 2023 • edited Loading

olzhas commented Mar 1, 2023

This comment was marked as spam.

kvanbere commented Mar 1, 2023

tyteen4a03 commented Mar 28, 2023

kvanbere commented Apr 1, 2023

jbaryy708 commented Apr 1, 2023

Ajaydip commented Apr 1, 2023 • edited Loading

kvanbere commented Apr 3, 2023

kvanbere commented Apr 3, 2023 • edited Loading

kvanbere commented Aug 29, 2023

kvanbere commented Aug 29, 2023 • edited Loading

bryanjtc commented Nov 9, 2023

kvanbere commented Nov 9, 2023

kvanbere commented Jan 30, 2023 •

edited

Loading

kvanbere commented Jan 30, 2023 •

edited

Loading

Ajaydip commented Apr 1, 2023 •

edited

Loading

kvanbere commented Apr 3, 2023 •

edited

Loading

kvanbere commented Aug 29, 2023 •

edited

Loading