Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize S3 re-upload #2289

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Optimize S3 re-upload #2289

wants to merge 4 commits into from

Conversation

bemoody
Copy link
Collaborator

@bemoody bemoody commented Sep 12, 2024

The "Create AWS bucket and send files" button currently causes every file in the project to be uploaded. Instead, it's better to check whether the destination files already exist, and only upload the files that are necessary.

  • If the background task gets interrupted and restarted (e.g., because we deployed a new version to the server) then it can avoid repeating a lot of the work that was done previously.

  • With care, we should be able to upload a large database to S3 from a secondary server (faster than uploading from the live server); after that's finished, click the "send files" button and it will simply check that everything was transferred correctly.

There are a bunch of caveats here; this only partially addresses the issues I raised in #1903 and #2103, but this is a modest change that should significantly speed up the ongoing process of releasing data on S3.

Limitations to note:

  • We still do not ever delete files from S3. If a file is removed from the published project, "send files to AWS" will not remove that file from the S3 bucket. If the deletion was intentional, it'll have to be removed from S3 (and GCS) manually. This is a long-standing issue with both GCS and S3.

  • Listing files per subdirectory (using Delimiter) isn't as efficient as listing all the files at once. This is a compromise to avoid downloading a potentially huge amount of data from S3 all at once and keeping it in memory. However, this could be improved by using sorted_tree_files instead of os.walk.

  • And of course this only addresses S3, not GCS; it would be better to do this in a way that's portable across cloud providers.

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 12, 2024

Draft because I want to wait for the current batch of S3 uploads (~120 GB) to finish, so I can test with a small project.

@bemoody bemoody marked this pull request as ready for review September 16, 2024 21:59
Benjamin Moody added 3 commits September 17, 2024 11:33
get_s3_object_info() retrieves the metadata of a single existing
object in an S3 bucket.  list_s3_subdir_object_info() retrieves
metadata of all objects in a subdirectory-style prefix.

S3ObjectInfo contains the metadata that is needed in order to check
whether two S3 objects match, or to check whether an S3 object matches
a local file.

S3 has a complicated way of calculating the ETag.  If the object is
uploaded in a single request then the ETag is the MD5 hash of the
content, but if it's a multi-part upload then the ETag is a hash of
the hashes of the parts.

boto3, by default, splits files into 8 MiB chunks (and for files of
exactly 8 MiB, it does a "multi-part" upload with exactly one part);
but there is no guarantee that other clients will do the same.  It's
infeasible to verify ETag in the general case without knowing the
algorithm used by the uploading client.  However, for our purposes, it
should be okay to assume that the file was uploaded by boto3 using the
default settings.
When trying to upload a project to S3, some files may already exist
and don't need to be re-uploaded.  For example, the background task
may be interrupted and retried, and in that case we don't want to
waste time and bandwidth uploading the same files again.
upload_project_to_S3() should now check to see which files already
exist in the destination bucket, and avoid re-uploading those that
haven't changed.

We want to test that this is the case, which we can do by adding
custom "object tagging": when an existing object is replaced by
re-uploading, any tags on the existing object are erased.

We also want to test that this works correctly for objects of 8 MiB
and larger, because the ETag calculation is not trivial (see
project.cloud.s3.s3_file_etag).
@bemoody
Copy link
Collaborator Author

bemoody commented Sep 25, 2024

@Chrystinne can you please take a look at this?

A good way to test it is:

  • create a demo AWS bucket
  • upload a project such as demoecg to the bucket
  • manually add another file to the static/demoecg/10.5.24 directory
  • upload the project to AWS a second time

You should see that the new file is added to the bucket, but the modification times of the existing files don't change.

@tompollard tompollard requested a review from Chrystinne October 22, 2024 02:49
@Chrystinne
Copy link
Contributor

@bemoody I'm running the tests by following these steps:

  1. I created a new credentialed-access project named creden-20.
  2. Published and submitted it to AWS.
    However, when managing the new project in the admin console, I see the following:

AWS
The files have been sent to AWS. The bucket name is: creden-20.

However, this is what we should see instead:

AWS
The files have been sent to AWS. The bucket name is: controlled-data-dev-server.

Also, the bucket named creden-20 is not being created on AWS S3, either inside controlled-data-dev-server as it should be or elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants