Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very frequent network timeouts on GitHub hosted macOS 11 runners #4896

Closed
1 of 7 tasks
smfeest opened this issue Jan 18, 2022 · 24 comments
Closed
1 of 7 tasks

Very frequent network timeouts on GitHub hosted macOS 11 runners #4896

smfeest opened this issue Jan 18, 2022 · 24 comments
Labels
bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS

Comments

@smfeest
Copy link

smfeest commented Jan 18, 2022

Description

Over the last couple of weeks, all of our workflows that run on the macOS 11 runners have been experiencing very high failure rates due to network connectivity issues and timeouts. In fact, in the last few days pretty much every single workflow run has failed due to a network timeout, though not always on the same step.

Examples

Cache restore

Step definition:

- uses: actions/cache@v2
  name: CocoaPods cache
  timeout-minutes: 2
  id: pods-cache
  with:
    path: ios/Pods
    key: macos11-yarn-pods-${{ hashFiles('**/Podfile.lock') }}
    restore-keys: |
    macos11-yarn-pods-

Typical failure:

Run actions/cache@v2
  with:
    path: ios/Pods
    key: macos11-yarn-pods-6a163931136924b2c0b5784dd82824e30689f1e87d2e0dd2d31b986b1ee394c3
    restore-keys: macos11-yarn-pods-
  
  env:
    NO_FLIPPER: true
    PATH: /Users/runner/hostedtoolcache/node/14.18.2/x64/bin:/usr/local/opt/pipx_bin:/Users/runner/.cargo/bin:/usr/local/opt/curl/bin:/usr/local/bin:/usr/local/sbin:/Users/runner/bin:/Users/runner/.yarn/bin:/Users/runner/Library/Android/sdk/tools:/Users/runner/Library/Android/sdk/platform-tools:/Users/runner/Library/Android/sdk/ndk-bundle:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/usr/bin:/bin:/usr/sbin:/sbin:/Users/runner/.dotnet/tools:/Users/runner/.ghcup/bin:/Users/runner/hostedtoolcache/stack/2.7.3/x64
Received 0 of 461859601 (0.0%), 0.0 MBs/sec
Received 0 of 461859601 (0.0%), 0.0 MBs/sec
Received 0 of 461859601 (0.0%), 0.0 MBs/sec
Received 33554432 of 461859601 (7.3%), 7.8 MBs/sec
Received 46137344 of 461859601 (10.0%), 8.6 MBs/sec
Received 79691776 of 461859601 (17.3%), 12.5 MBs/sec
Received 117440512 of 461859601 (25.4%), 15.8 MBs/sec
Received 130023424 of 461859601 (28.2%), 12.0 MBs/sec
Received 130023424 of 461859601 (28.2%), 10.9 MBs/sec
Received 130023424 of 461859601 (28.2%), 10.0 MBs/sec
Received 130023424 of 461859601 (28.2%), 9.1 MBs/sec
Received 134217728 of 461859601 (29.1%), 8.5 MBs/sec
Received 134217728 of 461859601 (29.1%), 8.0 MBs/sec
Received 134217728 of 461859601 (29.1%), 7.5 MBs/sec
Received 134217728 of 461859601 (29.1%), 7.0 MBs/sec
Received 134217728 of 461859601 (29.1%), 6.7 MBs/sec
Received 134217728 of 461859601 (29.1%), 6.3 MBs/sec
Received 134217728 of 461859601 (29.1%), 6.0 MBs/sec
Received 134217728 of 461859601 (29.1%), 5.4 MBs/sec
Received 134217728 of 461859601 (29.1%), 4.5 MBs/sec
Received 138412032 of 461859601 (30.0%), 2.6 MBs/sec
Error: The action has timed out.

Yarn install

Step definition:

- if: steps.yarn-cache.outputs.cache-hit != 'true'
  run: yarn install
  timeout-minutes: 4

Typical failure:

yarn install v1.22.17
[1/4] Resolving packages...
[2/4] Fetching packages...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
error An unexpected error occurred: "https://registry.yarnpkg.com/rxjs/-/rxjs-6.6.6.tgz: ESOCKETTIMEDOUT".
info If you think this is a bug, please open a bug report with the information provided in "/Users/runner/work/mobile/mobile/yarn-error.log".
info Visit https://yarnpkg.com/en/docs/cli/install for documentation about this command.
info There appears to be trouble with your network connection. Retrying...

Attempted workarounds

We have previously experimented with increasing the timeouts on these steps but more often than no this just delayed the eventual failure (and increased wasted billable minutes). When things are working correctly the affected steps tend to complete in 30 or 40 seconds, so our configured timeouts already allow for these operations to take 4 to 8 times longer than usual.

Other observations

  • We experienced similar issues with macOS 10.15 a few months ago, but migrating to the macOS 11 runners initially appeared to greatly reduce the frequency of these failures. However the problem is now as bad as it ever was with macOS 10.15.
  • Anecdotally, these issues previously occurred more frequently in the afternoon UK time (from around 13:00 UTC)

Virtual environments affected

  • Ubuntu 18.04
  • Ubuntu 20.04
  • macOS 10.15
  • macOS 11
  • Windows Server 2016
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Current runner version: '2.286.0'
Operating System
  macOS
  11.6.2
  20G314
Virtual Environment
  Environment: macos-11
  Version: 20220110.2
  Included Software: https://github.com/actions/virtual-environments/blob/macOS-11/20220110.2/images/macos/macos-11-Readme.md
  Image Release: https://github.com/actions/virtual-environments/releases/tag/macOS-11%2F20220110.2
Virtual Environment Provisioner
  1.0.0.0-main-20211214-1

All our builds are private unfortunately.

Is it regression?

No. Last successful build had same image version.

Expected behavior

  • Network connectivity issues are less frequent (similar frequency to other virtual environments)
  • Network and I/O operations (e.g. yarn install) complete in a similar amount of time on each run

Actual behavior

  • Network connectivity errors and timeouts occur very frequently
  • Operations like yarn install that take 30 seconds on successful builds often don't complete within the timeout period of up to 4 minutes

Repro steps

  1. For a repo with a non-trivial set of npm dependencies (e.g. a new React Native project), define a workflow that runs on the macOS 11 hosted runners and includes the cache and yarn install steps. For example, something like this:
name: Reproduce macOS network issues
on:
  workflow_dispatch:
jobs:
  test_cache_and_yarn_install:
    runs-on: macos-11
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - uses: actions/setup-node@v2
        with:
          node-version: '14'
      - uses: actions/cache@v2
        name: Yarn cache
        timeout-minutes: 2
        with:
          path: '**/node_modules'
          key: macos11-yarn-${{ hashFiles('**/yarn.lock') }}
          restore-keys: |
            macos11-yarn-
      - run: yarn install
        timeout-minutes: 4
  1. Run the workflow. Sometimes the cache restore and yarn install steps will complete within 30-40 seconds each but more often than not at the moment, they'll timeout after 4 minutes.
@nikolai-frolov nikolai-frolov added investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS and removed needs triage labels Jan 18, 2022
@nikolai-frolov
Copy link
Contributor

Hello @smfeest. Thanks for the issue reporting. We'll take care about this issue investigation and reply with findings.

@smfeest
Copy link
Author

smfeest commented Jan 18, 2022

One other thought is that we can't be 100% certain that the failures aren't cause by I/O instead of network issues since all of the affected actions are effectively downloading a large-ish amount of data to disk (for example I wonder if this could also be related to something like #3885).

In any case please don't hesitate to let us know if there's any diagnostic steps we can add to our workflows to help with the investigation.

@smfeest
Copy link
Author

smfeest commented Jan 27, 2022

Just an update on this. We've since increased our timeouts on the affected steps still further, and now we're seeing that on runs where the cache / yarn install steps take a very long time to complete, so do the subsequent steps that are not network based.

I therefore think it's more likely than not that this is another manifestation on #3885.

The problem is also far more common in the afternoons and evenings (UK time) than the mornings. By 5pm UTC it's essentially impossible to get a non-trivial macOS-based run to successfully complete.

@nikolai-frolov
Copy link
Contributor

Just an intermediate update: we are still investigating the root of connection interruptions and most likely it's actually related to infrastructure hardware issues which are mentioned in the #3885.

fvictorio added a commit to NomicFoundation/hardhat that referenced this issue Feb 4, 2022
We'll re-enable them when actions/runner-images#4896 is fixed
@nikolai-frolov
Copy link
Contributor

@smfeest could you please share your feedback about network stability of late? Has the situation improved or is issue still actual?

@fvictorio
Copy link

@nikolai-frolov I was lurking on this issue because we had the same problem in our repo. I re-enabled our macos jobs in our CI and ran it a couple of times. It does seem to work better now.

@smfeest
Copy link
Author

smfeest commented Feb 24, 2022

@smfeest could you please share your feedback about network stability of late? Has the situation improved or is issue still actual?

Hi @nikolai-frolov , the errors are still occurring but far less frequently than when I first raised this issue, I'd estimate we're experiencing these errors in about 10% of macOS workflow runs at present (compared to maybe 90%+ in mid Jan) which is not perfect but still very encouraging! I should also caution though that we've been running these workflows much less frequently recently, and rarely in the afternoon, so we might not be comparing like with like.

@miketimofeev
Copy link
Contributor

@smfeest @t0rr3sp3dr0 could you provide as many links to the runs as possible? It will help us distinguish particular environments that might be affected as the environment is assigned randomly at the start of the run.

@smfeest
Copy link
Author

smfeest commented Mar 22, 2022

@smfeest
Copy link
Author

smfeest commented Mar 28, 2022

@miketimofeev: Unfortunately today has been a particular bad day for workflows running on the MacOS runners with an almost 100% failure rate due to timeouts of one sort or another.

Examples of runs with timeouts on network steps:

A few workflow runs did manage to get past the network steps only to timeout in other ways (e.g. Android emulator failing to boot or respond to requests in a reasonable amount of time or Xcode build taking much longer than usual to complete):

As far as a can tell, the only MacOS based workflow that didn't timeout today was this one that ran at approx 03:00 UTC: https://github.com/futurelearn/mobile/actions/runs/2050007713

@miketimofeev
Copy link
Contributor

miketimofeev commented Mar 29, 2022

@nikita-nikolaev could you please analyze what went wrong with the macOS environments at that time?

@yermukhanbet
Copy link

yermukhanbet commented Apr 25, 2022

Were there any updates on this issue?
Currently I am trying to run one workflow, and it includes installing fastlane. Usually it takes around 2-3 minutes, but now it stacked on it for good 25 minutes already..

Screen Shot 2022-04-25 at 6 36 54 PM

@bombillazo
Copy link

I'm having trouble with macOS runners as well, worse when these runners are charged x10 the price or regular Linux runners 😞
Screen Shot 2022-05-03 at 12 43 46 AM

@pianetarosso
Copy link

pianetarosso commented May 27, 2022

Same issue today, I've had to restart all my 7 failed jobs because of yarn network failures, it's not a good advertising having lost at least 50 minutes each 3 times...

@niteshbalusu11
Copy link

It's getting really bad, all my jobs are failing.

@pwerry
Copy link

pwerry commented Jun 5, 2022

Code signing times out - can't run a build in under an hour that would normally take about 10 minutes. There's definitely some overall resource degradation, but networking appears to be the major bottleneck

@niteshbalusu11
Copy link

Here's how I'm getting around it for now, wrapping the yarn install deps command in a retry action.


      - name: Retry script
        uses: nick-fields/retry@v2
        with:
          timeout_minutes: 8
          max_attempts: 3
          retry_on: error
          command: yarn

@wenwuwu
Copy link

wenwuwu commented Oct 3, 2022

Code signing times out - can't run a build in under an hour that would normally take about 10 minutes. There's definitely some overall resource degradation, but networking appears to be the major bottleneck

Code signing times out + 1. Mac-OS runner is 10 times per-minute rate as Linux's, and it keeps timing out. Personal plan's 3000 minutes runner usually soon run out. This is ridiculous.

@clemp6r
Copy link

clemp6r commented Oct 4, 2022

We contacted the support and they informed us this is a yarn issue: yarnpkg/yarn#8242
Workaround: use yarn v2+ or npm.

@nazar-pc
Copy link

nazar-pc commented Dec 1, 2022

We hit this multiple times a day on macOS 12 with various kinds of requests, even requests to GitHub itself (cloning repos, downloading releases), which is especially funny.

@mikhailkoliada
Copy link
Contributor

macOS 11 is deprecated now, we highly recommend switching to macOS-13 and newer because OS11 is unlikely to get major overhaul.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS
Projects
None yet
Development

No branches or pull requests