-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very frequent network timeouts on GitHub hosted macOS 11 runners #4896
Comments
Hello @smfeest. Thanks for the issue reporting. We'll take care about this issue investigation and reply with findings. |
One other thought is that we can't be 100% certain that the failures aren't cause by I/O instead of network issues since all of the affected actions are effectively downloading a large-ish amount of data to disk (for example I wonder if this could also be related to something like #3885). In any case please don't hesitate to let us know if there's any diagnostic steps we can add to our workflows to help with the investigation. |
Just an update on this. We've since increased our timeouts on the affected steps still further, and now we're seeing that on runs where the cache / yarn install steps take a very long time to complete, so do the subsequent steps that are not network based. I therefore think it's more likely than not that this is another manifestation on #3885. The problem is also far more common in the afternoons and evenings (UK time) than the mornings. By 5pm UTC it's essentially impossible to get a non-trivial macOS-based run to successfully complete. |
Just an intermediate update: we are still investigating the root of connection interruptions and most likely it's actually related to infrastructure hardware issues which are mentioned in the #3885. |
We'll re-enable them when actions/runner-images#4896 is fixed
@smfeest could you please share your feedback about network stability of late? Has the situation improved or is issue still actual? |
@nikolai-frolov I was lurking on this issue because we had the same problem in our repo. I re-enabled our macos jobs in our CI and ran it a couple of times. It does seem to work better now. |
Hi @nikolai-frolov , the errors are still occurring but far less frequently than when I first raised this issue, I'd estimate we're experiencing these errors in about 10% of macOS workflow runs at present (compared to maybe 90%+ in mid Jan) which is not perfect but still very encouraging! I should also caution though that we've been running these workflows much less frequently recently, and rarely in the afternoon, so we might not be comparing like with like. |
We are also being affected by these connectivity problems constantly: |
@smfeest @t0rr3sp3dr0 could you provide as many links to the runs as possible? It will help us distinguish particular environments that might be affected as the environment is assigned randomly at the start of the run. |
Hi @miketimofeev , we've been running very few MacOS workflows over the past couple of months and those that we have run have rarely been affected by the original issue. However here are some of our most recent examples of the original issue: Cache restore timeouts:
Yarn install network errors:
|
Hi @miketimofeev, I would have more examples to give you but unfortunately our log retention is low and I can only distinguish network problems from other errors on recent runs. Here are the ones I have identified: https://github.com/inloco/mactions/runs/5557501155?check_suite_focus=true |
Hi @miketimofeev, here's some more from today (all three are network errors running https://github.com/futurelearn/mobile/actions/runs/2041784316/attempts/1 |
@nikita-nikolaev could you please analyze what went wrong with the macOS environments at that time? |
Same issue today, I've had to restart all my 7 failed jobs because of yarn network failures, it's not a good advertising having lost at least 50 minutes each 3 times... |
It's getting really bad, all my jobs are failing. |
Code signing times out - can't run a build in under an hour that would normally take about 10 minutes. There's definitely some overall resource degradation, but networking appears to be the major bottleneck |
Here's how I'm getting around it for now, wrapping the yarn install deps command in a retry action.
|
Code signing times out + 1. Mac-OS runner is 10 times per-minute rate as Linux's, and it keeps timing out. Personal plan's 3000 minutes runner usually soon run out. This is ridiculous. |
We contacted the support and they informed us this is a yarn issue: yarnpkg/yarn#8242 |
We hit this multiple times a day on macOS 12 with various kinds of requests, even requests to GitHub itself (cloning repos, downloading releases), which is especially funny. |
macOS 11 is deprecated now, we highly recommend switching to macOS-13 and newer because OS11 is unlikely to get major overhaul. |
Description
Over the last couple of weeks, all of our workflows that run on the macOS 11 runners have been experiencing very high failure rates due to network connectivity issues and timeouts. In fact, in the last few days pretty much every single workflow run has failed due to a network timeout, though not always on the same step.
Examples
Cache restore
Step definition:
Typical failure:
Yarn install
Step definition:
Typical failure:
Attempted workarounds
We have previously experimented with increasing the timeouts on these steps but more often than no this just delayed the eventual failure (and increased wasted billable minutes). When things are working correctly the affected steps tend to complete in 30 or 40 seconds, so our configured timeouts already allow for these operations to take 4 to 8 times longer than usual.
Other observations
Virtual environments affected
Image version and build link
All our builds are private unfortunately.
Is it regression?
No. Last successful build had same image version.
Expected behavior
yarn install
) complete in a similar amount of time on each runActual behavior
yarn install
that take 30 seconds on successful builds often don't complete within the timeout period of up to 4 minutesRepro steps
The text was updated successfully, but these errors were encountered: