Optimize S3 re-upload #2289

bemoody · 2024-09-12T15:29:10Z

The "Create AWS bucket and send files" button currently causes every file in the project to be uploaded. Instead, it's better to check whether the destination files already exist, and only upload the files that are necessary.

If the background task gets interrupted and restarted (e.g., because we deployed a new version to the server) then it can avoid repeating a lot of the work that was done previously.
With care, we should be able to upload a large database to S3 from a secondary server (faster than uploading from the live server); after that's finished, click the "send files" button and it will simply check that everything was transferred correctly.

There are a bunch of caveats here; this only partially addresses the issues I raised in #1903 and #2103, but this is a modest change that should significantly speed up the ongoing process of releasing data on S3.

Limitations to note:

We still do not ever delete files from S3. If a file is removed from the published project, "send files to AWS" will not remove that file from the S3 bucket. If the deletion was intentional, it'll have to be removed from S3 (and GCS) manually. This is a long-standing issue with both GCS and S3.
Listing files per subdirectory (using Delimiter) isn't as efficient as listing all the files at once. This is a compromise to avoid downloading a potentially huge amount of data from S3 all at once and keeping it in memory. However, this could be improved by using sorted_tree_files instead of os.walk.
And of course this only addresses S3, not GCS; it would be better to do this in a way that's portable across cloud providers.

bemoody · 2024-09-12T15:29:23Z

Draft because I want to wait for the current batch of S3 uploads (~120 GB) to finish, so I can test with a small project.

get_s3_object_info() retrieves the metadata of a single existing object in an S3 bucket. list_s3_subdir_object_info() retrieves metadata of all objects in a subdirectory-style prefix. S3ObjectInfo contains the metadata that is needed in order to check whether two S3 objects match, or to check whether an S3 object matches a local file. S3 has a complicated way of calculating the ETag. If the object is uploaded in a single request then the ETag is the MD5 hash of the content, but if it's a multi-part upload then the ETag is a hash of the hashes of the parts. boto3, by default, splits files into 8 MiB chunks (and for files of exactly 8 MiB, it does a "multi-part" upload with exactly one part); but there is no guarantee that other clients will do the same. It's infeasible to verify ETag in the general case without knowing the algorithm used by the uploading client. However, for our purposes, it should be okay to assume that the file was uploaded by boto3 using the default settings.

When trying to upload a project to S3, some files may already exist and don't need to be re-uploaded. For example, the background task may be interrupted and retried, and in that case we don't want to waste time and bandwidth uploading the same files again.

upload_project_to_S3() should now check to see which files already exist in the destination bucket, and avoid re-uploading those that haven't changed. We want to test that this is the case, which we can do by adding custom "object tagging": when an existing object is replaced by re-uploading, any tags on the existing object are erased. We also want to test that this works correctly for objects of 8 MiB and larger, because the ETag calculation is not trivial (see project.cloud.s3.s3_file_etag).

bemoody · 2024-09-25T17:39:41Z

@Chrystinne can you please take a look at this?

A good way to test it is:

create a demo AWS bucket
upload a project such as demoecg to the bucket
manually add another file to the static/demoecg/10.5.24 directory
upload the project to AWS a second time

You should see that the new file is added to the bucket, but the modification times of the existing files don't change.

Chrystinne · 2024-12-05T17:12:04Z

@bemoody I'm running the tests by following these steps:

I created a new credentialed-access project named creden-20.
Published and submitted it to AWS.
However, when managing the new project in the admin console, I see the following:

AWS
The files have been sent to AWS. The bucket name is: creden-20.

However, this is what we should see instead:

AWS
The files have been sent to AWS. The bucket name is: controlled-data-dev-server.

Also, the bucket named creden-20 is not being created on AWS S3, either inside controlled-data-dev-server as it should be or elsewhere.

s3: sort imports.

165e2ad

bemoody marked this pull request as ready for review September 16, 2024 21:59

Benjamin Moody added 3 commits September 17, 2024 11:33

bemoody force-pushed the bm/s3-opt-reupload branch from 02dee05 to 675a409 Compare September 17, 2024 15:34

tompollard requested a review from Chrystinne October 22, 2024 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize S3 re-upload #2289

Optimize S3 re-upload #2289

bemoody commented Sep 12, 2024

bemoody commented Sep 12, 2024

bemoody commented Sep 25, 2024

Chrystinne commented Dec 5, 2024

Optimize S3 re-upload #2289

Are you sure you want to change the base?

Optimize S3 re-upload #2289

Conversation

bemoody commented Sep 12, 2024

bemoody commented Sep 12, 2024

bemoody commented Sep 25, 2024

Chrystinne commented Dec 5, 2024