Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to NetDownload intrinsic. #16798

Merged
merged 4 commits into from
Sep 13, 2022

Conversation

danxmoran
Copy link
Contributor

@danxmoran danxmoran commented Sep 8, 2022

Closes #6818

For now, hard-code a retry strategy of 10ms, 100ms, 1s, 10s.

[ci skip-build-wheels]

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

})
};

// TODO: Allow the retry strategy to be configurable?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to leave as a TODO I think, assuming the error is enriched.

src/rust/engine/src/downloads.rs Outdated Show resolved Hide resolved
let retry_strategy = ExponentialBackoff::from_millis(10).map(jitter).take(3);
let response = RetryIf::spawn(retry_strategy, try_download, |err: &(String, bool)| err.1)
.await
.map_err(|(err, _)| err)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned: I think it's totally fine to TODO making it configurable, but the error message should likely be enriched to give some information about the retries: i.e.:

Suggested change
.map_err(|(err, _)| err)?;
.map_err(|(err, _)| format!("After {num_attempts} attempts: {err}"))?;

...so that the next person who comes along has a clear hint of where to go looking if they want to add configurability.

Copy link
Contributor Author

@danxmoran danxmoran Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be nice but I'm not sure we have access to the true num_attempts (I think it's hidden away inside the RetryIf::spawn implementation) 🤔 if we hit a 4xx error we won't retry at all.

Instead of modifying this error, what do you think of adding logging to the try_download branches? So we can log something like "hit 5xx error, retrying" or "hit 4xx error, not retrying"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like RetryIf::spawn only cares that the 1st arg implements IntoIterator with Duration items - I could write a thin wrapper around ExponentialBackoff that counts/logs the retry attempts as part of next()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be nice but I'm not sure we have access to the true num_attempts (I think it's hidden away inside the RetryIf::spawn implementation) 🤔 if we hit a 4xx error we won't retry at all.

I just meant extracting the constant "3" (or 4) from the code above, and then using it in two places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh. I see what you meant here. Was missing the RetryIf aspect of this. Hm, it's a bit awkward to lose the conditional retry for client errors... between the two, not reporting the number of retries would be preferable? Sorry for the pivot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, updated

@danxmoran
Copy link
Contributor Author

@stuhood comments addressed!

@stuhood stuhood added this to the 2.13.x milestone Sep 9, 2022
@stuhood stuhood enabled auto-merge (squash) September 12, 2022 17:07
@stuhood
Copy link
Member

stuhood commented Sep 12, 2022

Interesting... two timeouts in a row for src/python/pants/backend/python/util_rules/pex_test.py:tests. Will see if I can reproduce.

EDIT: No repro... 120s the first time, and 64s the second. Bumping the timeout.

@stuhood
Copy link
Member

stuhood commented Sep 12, 2022

Wow... it seems that the failure in pex_test.py is very reproducible. The fact that timeouts don't currently render any of the output from the timed out process is annoying: I'll take a look at whether we can fix that.

@stuhood
Copy link
Member

stuhood commented Sep 12, 2022

Wow... it seems that the failure in pex_test.py is very reproducible. The fact that timeouts don't currently render any of the output from the timed out process is annoying: I'll take a look at whether we can fix that.

Mm, @Eric-Arellano is seeing this in #16795 as well, so it seems like it might be a network issue independent of this PR.

Regardless: have opened #16841 to preserve stdio.

For now, hard-code a retry strategy of 10ms, 100ms, 1000ms.

[ci skip-build-wheels]
[ci skip-build-wheels]
This reverts commit beec08e.

[ci skip-rust]

[ci skip-build-wheels]
[ci skip-build-wheels]
@stuhood stuhood merged commit f72b8c0 into pantsbuild:main Sep 13, 2022
@Eric-Arellano
Copy link
Contributor

@stuhood this was targeted to 2.13, but should probably be only for 2.14 instead?

Eric-Arellano pushed a commit to Eric-Arellano/pants that referenced this pull request Sep 15, 2022
Closes pantsbuild#6818

For now, hard-code a retry strategy of 10ms, 100ms, 1s, 10s.

[ci skip-build-wheels]
@Eric-Arellano Eric-Arellano modified the milestones: 2.13.x, 2.14.x Sep 15, 2022
@stuhood
Copy link
Member

stuhood commented Sep 15, 2022

@stuhood this was targeted to 2.13, but should probably be only for 2.14 instead?

Possibly. I felt that it might qualify as a bugfix. @danxmoran : Will you want this in 2.13?

Eric-Arellano added a commit that referenced this pull request Sep 15, 2022
Closes #6818

For now, hard-code a retry strategy of 10ms, 100ms, 1s, 10s.

[ci skip-build-wheels]
@danxmoran
Copy link
Contributor Author

Will you want this in 2.13?

It would be nice, we do hit flakiness in CI semi-regularly that this would address. That said, if it'd be awhile until a 2.13.1 is released then it might not be worth the time - I'm currently working through getting us onto v2.14

@danxmoran danxmoran deleted the danxmoran/net-download-retry branch September 15, 2022 16:53
@Eric-Arellano
Copy link
Contributor

That said, if it'd be awhile until a 2.13.1 is released then it might not be worth the time

It doesn't need to be a long time, particularly if you can be a squeaky wheel and encourage us to do more frequent 2.13 releases :) I'll get out a 2.13 release today with it

Eric-Arellano pushed a commit to Eric-Arellano/pants that referenced this pull request Sep 15, 2022
Closes pantsbuild#6818

For now, hard-code a retry strategy of 10ms, 100ms, 1s, 10s.

[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this pull request Sep 15, 2022
Closes #6818

For now, hard-code a retry strategy of 10ms, 100ms, 1s, 10s.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Oct 20, 2022
As reported in #17294, if an HTTP stream is interrupted after it has opened, the retry that was added in #16798 won't kick in.

This change moves the retry up a level to wrap the entire download attempt, and adds a test of recovering from "post-header" errors.

Fixes #17294.
stuhood added a commit to stuhood/pants that referenced this pull request Oct 21, 2022
…sbuild#17298)

As reported in pantsbuild#17294, if an HTTP stream is interrupted after it has opened, the retry that was added in pantsbuild#16798 won't kick in.

This change moves the retry up a level to wrap the entire download attempt, and adds a test of recovering from "post-header" errors.

Fixes pantsbuild#17294.

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this pull request Oct 21, 2022
…sbuild#17298)

As reported in pantsbuild#17294, if an HTTP stream is interrupted after it has opened, the retry that was added in pantsbuild#16798 won't kick in.

This change moves the retry up a level to wrap the entire download attempt, and adds a test of recovering from "post-header" errors.

Fixes pantsbuild#17294.

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this pull request Oct 21, 2022
…sbuild#17298)

As reported in pantsbuild#17294, if an HTTP stream is interrupted after it has opened, the retry that was added in pantsbuild#16798 won't kick in.

This change moves the retry up a level to wrap the entire download attempt, and adds a test of recovering from "post-header" errors.

Fixes pantsbuild#17294.

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this pull request Oct 21, 2022
…sbuild#17298)

As reported in pantsbuild#17294, if an HTTP stream is interrupted after it has opened, the retry that was added in pantsbuild#16798 won't kick in.

This change moves the retry up a level to wrap the entire download attempt, and adds a test of recovering from "post-header" errors.

Fixes pantsbuild#17294.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Oct 21, 2022
…ry-pick of #17298) (#17302)

As reported in #17294, if an HTTP stream is interrupted after it has opened, the retry that was added in #16798 won't kick in.

This change moves the retry up a level to wrap the entire download attempt, and adds a test of recovering from "post-header" errors.

Fixes #17294.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retries to DownloadFile intrinsic
3 participants