-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow download retrying if the computed and configured checksums differ. #4455
base: master
Are you sure you want to change the base?
Conversation
jsfakian
commented
Dec 3, 2024
- We have identified that downloading images under poor networking conditions from datastores might fail silently due to the SDK libraries we use for for Azure and AWS. This happens when we are installing new applications.
- Another observation we made is that when we manually try re-installing the applications, they eventually succeed in the installation.
- With this patch, we automate retrying the image download a fixed number of times (5 by default) if the computed versus the configure checksums differ.
dbcf812
to
c8cf95f
Compare
c8cf95f
to
213e610
Compare
if config.RefCount != 0 && (status.HasError() || status.State != types.DOWNLOADED) { | ||
// or status is not downloaded or retry then do install | ||
if config.RefCount != 0 && (status.HasError() || | ||
status.State != types.DOWNLOADED || config.DoRetry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The State != types.DOWNLOADED
condition looks suspicious to me. Isn't it too vague? Shouldn't it be at least status.State < types.DOWNLOADED && ...
or even status.State == DOWNLOADING && ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was like that before. Should we change it?
213e610
to
9884a04
Compare
9884a04
to
9110498
Compare
9110498
to
619ad71
Compare
619ad71
to
9d1f5ae
Compare
// Retrying the download due to image verification failure | ||
// This happens when: RefCount and retryCount is non-zero, | ||
// and DownloaderStatus state is "downloaded" | ||
} else if config.RefCount != 0 && config.BlobDownloadRetryCount != 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that we can not match the if on line 402 when we intent to do a retry?
I think it would be safer to put this as the first if statement, and have line 402 in an else if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the "else if' as it was before.
Size: size, | ||
Target: locFilename, | ||
RefCount: refCount, | ||
BlobDownloadRetryCount: blobDownloadRetryCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I find the "blob" part of the field and variable names out of place. Nowhere in the downloader code do we refer to "blobs". That is a concept and name which belongs in the volumemgr. So please drop "blob" from these names.
return | ||
} | ||
if m.RefCount == 0 { | ||
log.Fatalf("retryDownload: Attempting to retry when "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know that this can never happen?
If the verifier is running and detects that the sha is mismatched and at about the same time zedagent starts the process of deleting the content tree then couldn't the refcount drop to zero? A log.Warnf seems like a better choice.
status.Size = 0 | ||
status.Progress = 0 | ||
status.State = types.DOWNLOADING | ||
status.BlobDownloadRetryCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. If you set it to config.BlobDownloadRetryCount I could understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I have the same understanding. If we take a counter approach, we only need to store the config value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the BlobDownloadRetryCount to LastRetry. I like this approach better.
blob.HasVerifierRef = false | ||
blob.ClearErrorWithSource() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsehwere we clear HasVerifierRef immediately after the call to MaybeRemvoeVerifyImageConfig and we also clear blob.Path. I think grouping them that way makes it more consistent and correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the function MaybeRemoveVerifyImageConfig and added blob.HasVerifierRef = false in it. I did something similar for MaybeRemoveDownloadConfig.
- We have identified that downloading images under poor networking conditions from datastores might fail silently due to the SDK libraries we use for for Azure and AWS. This happens when we are installing new applications. - Another observation we made is that when we manually try re-installing the applications, they eventually succeed in the installation. - With this patch, we automate retrying the image download a fixed number of times (5 by default) if the computed versus the configure checksums differ. Signed-off-by: Ioannis Sfakianakis <jsfakas@gmail.com>
9d1f5ae
to
99af741
Compare