Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakiness downloading codecov-bash #1231

Closed
njsmith opened this issue Oct 4, 2019 · 3 comments · Fixed by #1300
Closed

Flakiness downloading codecov-bash #1231

njsmith opened this issue Oct 4, 2019 · 3 comments · Fixed by #1300

Comments

@njsmith
Copy link
Member

njsmith commented Oct 4, 2019

So this is baffling to me, and I know it has nothing to do with anything we did, but it's super frustrating: for some reason, running curl https://codecov.io/bash has recently become super flaky and is making our CI runs fail all the time and we gotta do something.

Observations:

The most common failure seems to be a timeout during connect. I've also seen this, which is super weird:

+curl --retry 5 -o codecov.sh https://codecov.io/bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:15 --:--:--     0
curl: (35) gnutls_handshake() failed: Error in the pull function.

According to the curl docs, curl --retry 5 should retry on timeouts but.... I can't tell whether it's actually doing that?

Codecov as a whole is kind of flaky, so it's tempting to blame this on them. But I'm not sure this is actually their fault... codecov.io appears to be hosted by google (based on running whois on the IP address), so I'm guessing it's some google CDN or something? (Or maybe it's just some random server running on google cloud with no CDN in front of it, who knows?)

The failures seem to all be from Travis, not Azure. And I'm pretty sure Travis is running on Google Cloud servers. Wild supposition: maybe Travis's traffic is considered "internal" to Google Cloud and doesn't hit the CDN, while Azure's traffic is "external" and does hit the CDN? I have no idea. I guess we could try running traceroute from both Travis and Azure, but that seems like it might be running off down an unproductive rabbit-hole.

...oh wait, and actually it just failed on azure too, so never mind: https://dev.azure.com/python-trio/trio/_build/results?buildId=1031&view=logs&jobId=872bf439-86bb-5ce5-edcd-c35619d700a0

so... what the heck can we do about this? the obvious answer is to retry, but curl's --retry isn't working. Is that because there's something wrong with --retry, or is it because when we hit one of these failures, it's somehow "sticky"? Would putting our own retry loop around curl help?

I guess we could also like, stash a copy of codecov-bash somewhere more reliable, but that seems like it would create all kinds of operational annoyances.

@smurfix
Copy link
Contributor

smurfix commented Oct 4, 2019

Replace curl with wget?

@oremanj oremanj closed this as completed in 1a09cf9 Oct 4, 2019
@njsmith
Copy link
Member Author

njsmith commented Oct 4, 2019

So weirdly enough, it turns out this is actually an issue with curl --retry, and we were the first to notice and report it. Confirmed here: curl/curl#4461

Maybe wget would be a good option too, idk? At this point we have a workaround for the curl issue merged into master so I'm inclined to wait before it breaks again before we touch anything.

@njsmith
Copy link
Member Author

njsmith commented Oct 25, 2019

Apparently curl's --retry option is still not enough:

+curl --connect-timeout 5 --retry 5 -o codecov.sh https://codecov.io/bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:23 --:--:--     0
curl: (52) Empty reply from server

The command "./ci.sh" exited with 52.

We're not the only ones who noticed: pytest-dev/pytest#5951 (comment)

Maybe wget would be a better choice, like @smurfix suggested? I didn't know this, but the man page says:

  Wget has been designed for robustness over slow or unstable network
  connections; if a download fails due to a network problem, it will keep
  retrying until the whole file has been retrieved.  If the server
  supports regetting, it will instruct the server to continue the
  download from where it left off.

@njsmith njsmith reopened this Oct 25, 2019
njsmith added a commit to njsmith/trio that referenced this issue Nov 9, 2019
In the hopes that it copes better with flaky servers/networks than
curl does.

The hope is that this will finally resolve python-triogh-1231.
njsmith added a commit to njsmith/trio that referenced this issue Nov 9, 2019
The hope is that this will finally resolve python-triogh-1231.
belm0 pushed a commit that referenced this issue Nov 10, 2019
The hope is that this will finally resolve gh-1231.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants