Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce macro for retry of downloads with additive backoff #1722

Closed
wants to merge 1 commit into from

Conversation

das-sein
Copy link
Contributor

@das-sein das-sein commented Mar 20, 2019

Resolves: #1667

Previous behavior: download failures during any installation process were entirely unrecoverable and would result in a critical failure.

New behavior: downloads are retried with an additive backoff delay, where the delay is defined as the sum of the current delay (initialized to zero) and the duration of the failed request per attempt. Three attempts are set by default, after which the normal critical failure will occur if the download is still failing.

@das-sein das-sein force-pushed the retry-failed-downloads branch from 5864385 to f73b8c2 Compare March 20, 2019 17:53
Copy link
Contributor

@kinnison kinnison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a basic concept it's quite sane, but before I could approve it I'd be good to know if there're any classes of error return where we should abandon the retry cycle and short-circuit out immediately.

@kinnison
Copy link
Contributor

ALso, I'm not certain it needs to be a macro, unless there're other sites you're interested in annotating with it?

Copy link
Member

@nrc nrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @kinnison said, I don't think this should be a macro. It looks like it could be a function which takes a closure argument

( $( $call:expr )+ ) => {
{
let mut ret;
let mut retries = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to be able to configure this somewhere or at least for it to be a const somewhere obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filled #1667 and I'd like this to be configurable. Our main use case is CI, and the number of retries that might make sense for Travis-CI is probably different than for Appveyor, Cirrust-CI, Azure, etc.

Also, 3 retries weren't enough for travis, and now we are using 5 retries: https://github.com/rust-lang/libc/blob/master/ci/build.sh#L31

The backoff time of our retries includes the time it takes for rustup to fail after a download fails, so it might be slightly different from what's done here (but larger I expect), so we'd might need even a larger number of retries here.

@rbtcollins
Copy link
Contributor

I think the following errors should not retry:

  • failure to write the file (permissions, disk space)
  • 40x class HTTP errors : should only be retried with appropriate client side corrections (e.g. add the right credentials then retry).

Generally I'd say its better to be narrow about retries: retrying can only fix a problem where things are basically ok but interrupted - e.g. UDP/TCP connectivity failures, or process failures at the far end (which show up as variously, name lookup errors, or dropped or stalled connections). Retrying 'just because' is a great way to have very slow failures, rather than very robust code.

@gnzlbg
Copy link
Contributor

gnzlbg commented May 23, 2019

Retrying 'just because' is a great way to have very slow failures, rather than very robust code.

I don't disagree, but when I use rustup I expect it to succeed and performance in the failure path is not really important for my use cases.

That is, I'd rather have slow failures with this PR, than spurious failures which is what we have now. One can always improve the performance of slow failures later, as long as that does not re-introduce spurious failures.

@kinnison
Copy link
Contributor

@gnzlbg Could you provide an indication of the types of failures you're having to retry to get around? E.g. is this things like premature URL closure, or is it things like caching proxies claiming things don't exist when they actually should?

@gnzlbg
Copy link
Contributor

gnzlbg commented May 23, 2019

Not really, IIRC we just get that a download failed. I recall something about the network timing out, don't know if that means that the download did not start, or that it was interrupted, but when using curl directly, we see these two cases every now and then. We download our Android tools with curl, and I just bumped our retry for them today to 20 retries because travis-CI was having a bad day and 10 wasn't apparently enough and that fixed the issue =/

@rbtcollins
Copy link
Contributor

These are the errors I see:

rustup toolchain install nightly
info: syncing channel updates for 'nightly-x86_64-pc-windows-msvc'
info: latest update on 2019-05-23, rust version 1.36.0-nightly (37ff5d388 2019-05-22)
info: downloading component 'rustc'
 41.1 MiB /  60.6 MiB ( 68 %)   1.9 MiB/s in  1m  1s ETA: 10serror: component download failed for rustc-x86_64-pc-windows-msvc
info: caused by: could not download file from 'https://static.rust-lang.org/dist/2019-05-23/rustc-nightly-x86_64-pc-windows-msvc.tar.xz' to 'C:\Users\robertc\.rustup\downloads\ac379c4ba3b9faf074b7e9684c430d95ceb07053756f78b3c6f31d39d5d8934f.partial'
info: caused by: error reading from socket
info: caused by: timed out

Fibre 1Gbps IPv6/v4 dual stack connection; usually takes 2-3 seconds to download the artifacts, so yar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry downloads
5 participants