-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add timeout/retry logic to docker cache download #13573
Add timeout/retry logic to docker cache download #13573
Conversation
ci/docker_cache.py
Outdated
@@ -131,14 +134,15 @@ def _login_dockerhub(): | |||
logging.error(p.stderr) | |||
|
|||
# Linear backoff | |||
time.sleep(1000 * DOCKERHUB_RETRY_SECONDS * (i + 1)) | |||
time.sleep(DOCKERHUB_RETRY_SECONDS * (i + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use @Retry decorator, it's in util under CI I think. just so we don't reimplement retry logic everytime. Or maybe you can abstract this in another retry_with_timeout deco.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, why not also above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to keep this PR specific to this method, but I guess it makes more sense to add it also, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, LGTM
02b60f1
to
b25984a
Compare
@mxnet-label-bot update[CI, pr-awaiting-merge] |
@marcoabreu @KellenSunderland Can someone merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for working on this @jlcontreras
* Added timeout/retry (linear backoff) to docker cache download * Units changed, as time.sleep takes seconds as argument * Improved error handling * Using retry decorator * Added retry decorator to _login_dockerhub method * Fixed wrong import
* Added timeout/retry (linear backoff) to docker cache download * Units changed, as time.sleep takes seconds as argument * Improved error handling * Using retry decorator * Added retry decorator to _login_dockerhub method * Fixed wrong import
Dockerhub sometimes has unstable behavior, causing downloads to take much longer than they should. Average download time is below 2min, but in some rare cases it can take more than 1h (see http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/2033/pipeline/ for example)
This PR wraps the cache download call with timeout/retry logic as a fix