Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace MD5 calculations with use of S3 hashes #728

Open
lladdy opened this issue Jan 14, 2024 · 6 comments
Open

Replace MD5 calculations with use of S3 hashes #728

lladdy opened this issue Jan 14, 2024 · 6 comments
Assignees

Comments

@lladdy
Copy link
Contributor

lladdy commented Jan 14, 2024

No description provided.

@lladdy lladdy self-assigned this Jan 14, 2024
@lladdy lladdy moved this to 📋 Backlog in General Tasks Jan 14, 2024
@lladdy lladdy moved this from 📋 Backlog to 🏗 Todo in General Tasks Jan 21, 2024
@lladdy
Copy link
Contributor Author

lladdy commented Jan 21, 2024

Double check that the S3 files always have MD5 as etags.

@lladdy
Copy link
Contributor Author

lladdy commented Jan 26, 2024

@lladdy
Copy link
Contributor Author

lladdy commented Feb 4, 2024

Items of note:
Multipart chunk size depends on upload client.
Chunk size for tested bot files appears to be 8 mebibytes.
AWS users have report that the chunk size is not necessarily consistent for every chunk (though, potentially not of concern for us).

Reference python implementation:

def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []

    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))

    if len(md5s) < 1:
        return hashlib.md5().hexdigest()

    if len(md5s) == 1:
        return md5s[0].hexdigest()

    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

md5 = calculate_s3_etag("/path/to/bot.zip")
print(md5)

@lladdy
Copy link
Contributor Author

lladdy commented Feb 11, 2024

I've deployed a new AC API version which uses S3 etags: #747

@lladdy
Copy link
Contributor Author

lladdy commented Feb 12, 2024

After the ACs are updated, we can hopefully remove the manual hash calc code.

@lladdy
Copy link
Contributor Author

lladdy commented Mar 10, 2024

Dan might have time to update ACs the week starting the 17th March.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 Todo
Development

No branches or pull requests

1 participant