-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize S3 re-upload #2289
base: dev
Are you sure you want to change the base?
Optimize S3 re-upload #2289
Conversation
Draft because I want to wait for the current batch of S3 uploads (~120 GB) to finish, so I can test with a small project. |
get_s3_object_info() retrieves the metadata of a single existing object in an S3 bucket. list_s3_subdir_object_info() retrieves metadata of all objects in a subdirectory-style prefix. S3ObjectInfo contains the metadata that is needed in order to check whether two S3 objects match, or to check whether an S3 object matches a local file. S3 has a complicated way of calculating the ETag. If the object is uploaded in a single request then the ETag is the MD5 hash of the content, but if it's a multi-part upload then the ETag is a hash of the hashes of the parts. boto3, by default, splits files into 8 MiB chunks (and for files of exactly 8 MiB, it does a "multi-part" upload with exactly one part); but there is no guarantee that other clients will do the same. It's infeasible to verify ETag in the general case without knowing the algorithm used by the uploading client. However, for our purposes, it should be okay to assume that the file was uploaded by boto3 using the default settings.
When trying to upload a project to S3, some files may already exist and don't need to be re-uploaded. For example, the background task may be interrupted and retried, and in that case we don't want to waste time and bandwidth uploading the same files again.
upload_project_to_S3() should now check to see which files already exist in the destination bucket, and avoid re-uploading those that haven't changed. We want to test that this is the case, which we can do by adding custom "object tagging": when an existing object is replaced by re-uploading, any tags on the existing object are erased. We also want to test that this works correctly for objects of 8 MiB and larger, because the ETag calculation is not trivial (see project.cloud.s3.s3_file_etag).
02dee05
to
675a409
Compare
@Chrystinne can you please take a look at this? A good way to test it is:
You should see that the new file is added to the bucket, but the modification times of the existing files don't change. |
@bemoody I'm running the tests by following these steps:
AWS However, this is what we should see instead: AWS Also, the bucket named creden-20 is not being created on AWS S3, either inside controlled-data-dev-server as it should be or elsewhere. |
The "Create AWS bucket and send files" button currently causes every file in the project to be uploaded. Instead, it's better to check whether the destination files already exist, and only upload the files that are necessary.
If the background task gets interrupted and restarted (e.g., because we deployed a new version to the server) then it can avoid repeating a lot of the work that was done previously.
With care, we should be able to upload a large database to S3 from a secondary server (faster than uploading from the live server); after that's finished, click the "send files" button and it will simply check that everything was transferred correctly.
There are a bunch of caveats here; this only partially addresses the issues I raised in #1903 and #2103, but this is a modest change that should significantly speed up the ongoing process of releasing data on S3.
Limitations to note:
We still do not ever delete files from S3. If a file is removed from the published project, "send files to AWS" will not remove that file from the S3 bucket. If the deletion was intentional, it'll have to be removed from S3 (and GCS) manually. This is a long-standing issue with both GCS and S3.
Listing files per subdirectory (using
Delimiter
) isn't as efficient as listing all the files at once. This is a compromise to avoid downloading a potentially huge amount of data from S3 all at once and keeping it in memory. However, this could be improved by usingsorted_tree_files
instead ofos.walk
.And of course this only addresses S3, not GCS; it would be better to do this in a way that's portable across cloud providers.