-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow use of S3 single part uploads #400
Conversation
Hi, sorry it has taken me so long to get around to this. I think the motivation behind your PR is good. Single-part uploads make sense under some conditions. I think the execution of the idea could be a little better. First, we should leave it up to the user to decide how small is too small for a multipart upload, instead of hardcoding a threshold ourselves. So the call to smart_open would look like this instead: with open(..., 'w', transport_params={'multipart_upload': True}) as fout:
... On the implementation side, I see you've gone with conditionals inside the existing class. This is a bit too difficult to maintain, as it complicates the code unnecessarily. I suggest you implement a separate class to take care of single-part uploading. This can be entirely separate to the existing class. This class would be extremely simple, because you'd be loading the whole single part into memory, and wouldn't need much of the logic of the existing class. The above approach would hit two birds with one stone:
Let me know if you're up to it. |
Hi, thanks for looking at the PR. I will redesign the whole approach and use a separate class as you proposed, no problem. The current way was the simplest solution for solving the use case we had and I was not sure if you would agree to have the feature. Therefore I just wanted to get your feedback and guidance first :) I would then have the I am up for this and hope to have the updated PR ready soon. |
17659e5
to
4214657
Compare
4214657
to
34feb3f
Compare
I split the single part upload into its own class and hope that I covered the single part upload fully with tests. I am awaiting your comments and hope, that's how you were thinking about how this should be implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left you some minor comments.
Please fix flake8 warnings in your code @adrpar |
78b5a5e
to
741b047
Compare
Rebased and addressed your comments @mpenkov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I reviewed your changes and left you some comments. Please have a look and let me know if you have questions.
Thank you for the effort @adrpar . I'm merging this. |
I had encountered a use case, where I needed to upload many different files into S3 and wanted to use smart_open. The files were a large set of small and some sporadic large files.
Writing all the files to my S3 bucket proved slow, since smart_open is meant for large file and does thus use multipart upload. This resulted in significant overhead.
I am a huge fan of the smart_open interface and ease of use, and though about how such a use case could be supported in smart_open.
I fixed my issue, by introducing an optional parameter hinting at the size of the input stream (often this is known) and then added the possibility for the S3 writer to use direct upload if better suited.
This PR is far from done (no handling of what happens if the input stream proves larger than claimed) and I just wanted to propose and start a discussion on such a feature. Further, there are many different ways on how smart_open could be guided to use direct upload. Maybe this is not even wanted...
Therefore, thanks for having a look at this and let's start a discussion and I am happy to adjust my implementation :)