Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow download retrying if the computed and configured checksums differ. #4455

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jsfakian
Copy link
Contributor

@jsfakian jsfakian commented Dec 3, 2024

  • We have identified that downloading images under poor networking conditions from datastores might fail silently due to the SDK libraries we use for for Azure and AWS. This happens when we are installing new applications.
  • Another observation we made is that when we manually try re-installing the applications, they eventually succeed in the installation.
  • With this patch, we automate retrying the image download a fixed number of times (5 by default) if the computed versus the configure checksums differ.

@jsfakian jsfakian force-pushed the BUG-CI349-checksum-mismatch branch from c8cf95f to 213e610 Compare December 3, 2024 10:30
@github-actions github-actions bot requested a review from OhmSpectator December 3, 2024 10:31
if config.RefCount != 0 && (status.HasError() || status.State != types.DOWNLOADED) {
// or status is not downloaded or retry then do install
if config.RefCount != 0 && (status.HasError() ||
status.State != types.DOWNLOADED || config.DoRetry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The State != types.DOWNLOADED condition looks suspicious to me. Isn't it too vague? Shouldn't it be at least status.State < types.DOWNLOADED && ... or even status.State == DOWNLOADING && ... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like that before. Should we change it?

@jsfakian jsfakian force-pushed the BUG-CI349-checksum-mismatch branch from 213e610 to 9884a04 Compare December 10, 2024 15:39
@jsfakian jsfakian force-pushed the BUG-CI349-checksum-mismatch branch from 9884a04 to 9110498 Compare December 10, 2024 15:41
@jsfakian jsfakian force-pushed the BUG-CI349-checksum-mismatch branch from 9110498 to 619ad71 Compare December 13, 2024 15:49
@jsfakian jsfakian force-pushed the BUG-CI349-checksum-mismatch branch from 619ad71 to 9d1f5ae Compare December 16, 2024 11:24
// Retrying the download due to image verification failure
// This happens when: RefCount and retryCount is non-zero,
// and DownloaderStatus state is "downloaded"
} else if config.RefCount != 0 && config.BlobDownloadRetryCount != 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that we can not match the if on line 402 when we intent to do a retry?
I think it would be safer to put this as the first if statement, and have line 402 in an else if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the "else if' as it was before.

Size: size,
Target: locFilename,
RefCount: refCount,
BlobDownloadRetryCount: blobDownloadRetryCount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I find the "blob" part of the field and variable names out of place. Nowhere in the downloader code do we refer to "blobs". That is a concept and name which belongs in the volumemgr. So please drop "blob" from these names.

return
}
if m.RefCount == 0 {
log.Fatalf("retryDownload: Attempting to retry when "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that this can never happen?
If the verifier is running and detects that the sha is mismatched and at about the same time zedagent starts the process of deleting the content tree then couldn't the refcount drop to zero? A log.Warnf seems like a better choice.

status.Size = 0
status.Progress = 0
status.State = types.DOWNLOADING
status.BlobDownloadRetryCount++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. If you set it to config.BlobDownloadRetryCount I could understand it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I have the same understanding. If we take a counter approach, we only need to store the config value here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the BlobDownloadRetryCount to LastRetry. I like this approach better.

Comment on lines 240 to 239
blob.HasVerifierRef = false
blob.ClearErrorWithSource()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsehwere we clear HasVerifierRef immediately after the call to MaybeRemvoeVerifyImageConfig and we also clear blob.Path. I think grouping them that way makes it more consistent and correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the function MaybeRemoveVerifyImageConfig and added blob.HasVerifierRef = false in it. I did something similar for MaybeRemoveDownloadConfig.

- We have identified that downloading images under poor networking conditions
from datastores might fail silently due to the SDK libraries we use for
for Azure and AWS. This happens when we are installing new applications.
- Another observation we made is that when we manually try re-installing
the applications, they eventually succeed in the installation.
- With this patch, we automate retrying the image download a fixed number
of times (5 by default) if the computed versus the configure checksums differ.

Signed-off-by: Ioannis Sfakianakis <jsfakas@gmail.com>
@jsfakian jsfakian force-pushed the BUG-CI349-checksum-mismatch branch from 9d1f5ae to 99af741 Compare December 20, 2024 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants