-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3 transfer manager and performance improvements #109
Conversation
This replaces the naive multipart upload implementation with the new S3 transfer manager. Unfortunately, in order for the transfer manager to do multipart uploads, it must use the CRT client, but the CRT client does not currently allow you to specify path-style access, which is a requirement for some non-AWS S3 implementations. Hopefully they add support soon.
MultipartS3AsyncClient currently throws an unsupported operation exception if you attempt to download a file with it. However, it's needed if you want to use the transer manager with a non-CRT client. So, we annoyingly have to fish the delegate out instead.
1. Fixes a bug that can result in a file being incorrectly deleted from the staging directory. The bug is triggered by adding two files with identical content, and then adding the first file a second time, all with the same commit. 2. Adds a test for concurrently writing to a version.
Update the docs to describe how to configure the s3 client to work with 3rd party s3 implementations.
@pwinckles -- The documentation for this project is great! Thanks for including the instructions for using the CRT client with 3rd party S3 implementations. I noticed that the files still list the University of Wisconsin in the MIT License comments. Can the license comments on new/updated files be updated to list the OCFL organization instead? |
// var s3Client = S3AsyncClient.builder() | ||
// .region(Region.US_EAST_2) | ||
// .httpClientBuilder(NettyNioAsyncHttpClient.builder() | ||
// .connectionAcquisitionTimeout(Duration.ofSeconds(60)) | ||
// .writeTimeout(Duration.ofSeconds(0)) | ||
// .readTimeout(Duration.ofSeconds(0)) | ||
// .maxConcurrency(100)) | ||
// .build(); | ||
// var transferManager = S3TransferManager.builder() | ||
// .s3Client(MultipartS3AsyncClient.create( | ||
// s3Client, MultipartConfiguration.builder().build())) | ||
// .build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwinckles -- Do you want to save this commented-out code or can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class I mostly use for experimentation, so, yeah, I had planned on keeping that code.
That's a tricky question. The project wasn't relicensed when it transferred over. I don't think keeping the same header is technically wrong. @awoods do you have an opinion on this matter? If you do want something else in the header, what would you like? |
@pwinckles : The OCFL community and associate Github organization has offered hosting and coordination of the 'ocfl-java' project. However, there is no legal entity related to the OCFL that could act as the license holder of the software. I am inclined to leave the licensing of 'ocfl-java' as it currently stands. |
@pwinckles : I agree, it's probably best to keep the MIT license as-is. Can you modify the Copyright statement for UW to |
Sure, I can do that in a separate PR |
@BeckyAndresen Can you confirm that these changes worked for you? |
@pwinckles -- These changes look good to me. I tested uploading, validating, and retrieving content from a test OCFL repository using the default transfer manager. Those operations were all successful when I built the CRT client with Thanks again for figuring out that not all S3 implementations support the object integrity checks and for working on these improvements! |
Great! Thank you for trying it! |
This is a big PR that rolls together a bunch of changes a I had been working on that got gummed up with the s3 transfer manager compatibility debacle. Included here:
Bug Fixes
the staged file to be erroneously deleted.
Changes
S3AsyncClient
S3 client now must be used with ocfl-java-aws, and the sync version is no longer supported.to upload and download files from S3.
OcflObjectUpdater
was updated to be thread safe, enabling concurrently writing files to it. This may speed upwriting a large number of files to an object.
Refer to the sections I added to the usage guide for more details about the S3 transfer manager and concurrent object updates. Note that special care is required to use the transfer manager with a third party S3 implementation. Also, concurrent object writes are not necessarily faster if you're using traditional threads, though this may not hold if you use virtual threads.
I decided to include details about how to use the Netty client, because I am a bit dubious of the CRT client in its current state. I was seeing some shockingly bad performance from it in the integration tests, but that could be circumstantial. Use the client that works best for you.I just some additional comparison of testing the CRT client vs the Netty client against AWS, and the CRT client does seem to have better performance after all, and it's easier to configure. I'm still leaving in the docs for the Netty client in case someone wants to use it for some reason.
I did a quick performance test from an ec2 host writing to s3. The test wrote ocfl objects containing 100 256kb files, 10 1mb files, and 1 100mb file. Here's the summary of the times to write the objects in milliseconds.