Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 transfer manager and performance improvements #109

Merged
merged 22 commits into from
Apr 9, 2024

Conversation

pwinckles
Copy link
Collaborator

@pwinckles pwinckles commented Mar 3, 2024

This is a big PR that rolls together a bunch of changes a I had been working on that got gummed up with the s3 transfer manager compatibility debacle. Included here:

Bug Fixes

  • Deleting an object in S3 that contains more than 1,000 files now works.
  • Writing to files with identical content and writing the first file a second time to the same version no longer causes
    the staged file to be erroneously deleted.

Changes

  • Breaking: A S3AsyncClient S3 client now must be used with ocfl-java-aws, and the sync version is no longer supported.
  • ocfl-java-aws now uses the S3 Transfer Manager
    to upload and download files from S3.
  • ocfl-java-aws now concurrently uploads files when writing an object to S3. This should improve object write performance.
  • The OcflObjectUpdater was updated to be thread safe, enabling concurrently writing files to it. This may speed up
    writing a large number of files to an object.

Refer to the sections I added to the usage guide for more details about the S3 transfer manager and concurrent object updates. Note that special care is required to use the transfer manager with a third party S3 implementation. Also, concurrent object writes are not necessarily faster if you're using traditional threads, though this may not hold if you use virtual threads.

I decided to include details about how to use the Netty client, because I am a bit dubious of the CRT client in its current state. I was seeing some shockingly bad performance from it in the integration tests, but that could be circumstantial. Use the client that works best for you.

I just some additional comparison of testing the CRT client vs the Netty client against AWS, and the CRT client does seem to have better performance after all, and it's easier to configure. I'm still leaving in the docs for the Netty client in case someone wants to use it for some reason.

I did a quick performance test from an ec2 host writing to s3. The test wrote ocfl objects containing 100 256kb files, 10 1mb files, and 1 100mb file. Here's the summary of the times to write the objects in milliseconds.

ocfl-java version Min Mean Max
2.0.1 12,104 14,673 19,126
This PR 4,047 9,347 13,572

pwinckles added 14 commits March 1, 2024 20:08
This replaces the naive multipart upload implementation with the new
S3 transfer manager. Unfortunately, in order for the transfer manager
to do multipart uploads, it must use the CRT client, but the CRT
client does not currently allow you to specify path-style access,
which is a requirement for some non-AWS S3 implementations. Hopefully
they add support soon.
MultipartS3AsyncClient currently throws an unsupported operation
exception if you attempt to download a file with it. However, it's
needed if you want to use the transer manager with a non-CRT client.
So, we annoyingly have to fish the delegate out instead.
1. Fixes a bug that can result in a file being incorrectly deleted
from the staging directory. The bug is triggered by adding two files
with identical content, and then adding the first file a second time,
all with the same commit.
2. Adds a test for concurrently writing to a version.
Update the docs to describe how to configure the s3 client to work
with 3rd party s3 implementations.
@pwinckles pwinckles marked this pull request as ready for review March 21, 2024 03:24
@BeckyAndresen
Copy link

@pwinckles -- The documentation for this project is great! Thanks for including the instructions for using the CRT client with 3rd party S3 implementations.

I noticed that the files still list the University of Wisconsin in the MIT License comments. Can the license comments on new/updated files be updated to list the OCFL organization instead?

Comment on lines +386 to +397
// var s3Client = S3AsyncClient.builder()
// .region(Region.US_EAST_2)
// .httpClientBuilder(NettyNioAsyncHttpClient.builder()
// .connectionAcquisitionTimeout(Duration.ofSeconds(60))
// .writeTimeout(Duration.ofSeconds(0))
// .readTimeout(Duration.ofSeconds(0))
// .maxConcurrency(100))
// .build();
// var transferManager = S3TransferManager.builder()
// .s3Client(MultipartS3AsyncClient.create(
// s3Client, MultipartConfiguration.builder().build()))
// .build();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwinckles -- Do you want to save this commented-out code or can it be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class I mostly use for experimentation, so, yeah, I had planned on keeping that code.

@pwinckles
Copy link
Collaborator Author

pwinckles commented Apr 8, 2024

I noticed that the files still list the University of Wisconsin in the MIT License comments. Can the license comments on new/updated files be updated to list the OCFL organization instead?

That's a tricky question. The project wasn't relicensed when it transferred over. I don't think keeping the same header is technically wrong.

@awoods do you have an opinion on this matter? If you do want something else in the header, what would you like?

@awoods
Copy link
Member

awoods commented Apr 8, 2024

@pwinckles : The OCFL community and associate Github organization has offered hosting and coordination of the 'ocfl-java' project. However, there is no legal entity related to the OCFL that could act as the license holder of the software. I am inclined to leave the licensing of 'ocfl-java' as it currently stands.

@sprater
Copy link
Contributor

sprater commented Apr 9, 2024

@pwinckles : I agree, it's probably best to keep the MIT license as-is. Can you modify the Copyright statement for UW to
Copyright (c) 2019-2021 University of Wisconsin Board of Regents on all our copyright declarations? If you want to add your own copyright for changes you made after 2021, that's fine, though not necessary, as I understand it.

@pwinckles
Copy link
Collaborator Author

@pwinckles : I agree, it's probably best to keep the MIT license as-is. Can you modify the Copyright statement for UW to Copyright (c) 2019-2021 University of Wisconsin Board of Regents on all our copyright declarations? If you want to add your own copyright for changes you made after 2021, that's fine, though not necessary, as I understand it.

Sure, I can do that in a separate PR

@pwinckles
Copy link
Collaborator Author

@BeckyAndresen Can you confirm that these changes worked for you?

@BeckyAndresen
Copy link

@pwinckles -- These changes look good to me. I tested uploading, validating, and retrieving content from a test OCFL repository using the default transfer manager. Those operations were all successful when I built the CRT client with .checksumValidationEnabled(false) as you specified in the usage guide. I did not manually test all of the changes in this PR, but I see that you updated and added a lot of tests.

Thanks again for figuring out that not all S3 implementations support the object integrity checks and for working on these improvements!

@pwinckles
Copy link
Collaborator Author

@pwinckles -- These changes look good to me. I tested uploading, validating, and retrieving content from a test OCFL repository using the default transfer manager. Those operations were all successful when I built the CRT client with .checksumValidationEnabled(false) as you specified in the usage guide. I did not manually test all of the changes in this PR, but I see that you updated and added a lot of tests.

Thanks again for figuring out that not all S3 implementations support the object integrity checks and for working on these improvements!

Great! Thank you for trying it!

@pwinckles pwinckles merged commit b6e1cd3 into OCFL:main Apr 9, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants