Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow use of S3 single part uploads #400

Merged
merged 7 commits into from
Mar 15, 2020

Conversation

adrpar
Copy link
Contributor

@adrpar adrpar commented Dec 8, 2019

I had encountered a use case, where I needed to upload many different files into S3 and wanted to use smart_open. The files were a large set of small and some sporadic large files.

Writing all the files to my S3 bucket proved slow, since smart_open is meant for large file and does thus use multipart upload. This resulted in significant overhead.

I am a huge fan of the smart_open interface and ease of use, and though about how such a use case could be supported in smart_open.

I fixed my issue, by introducing an optional parameter hinting at the size of the input stream (often this is known) and then added the possibility for the S3 writer to use direct upload if better suited.

This PR is far from done (no handling of what happens if the input stream proves larger than claimed) and I just wanted to propose and start a discussion on such a feature. Further, there are many different ways on how smart_open could be guided to use direct upload. Maybe this is not even wanted...

Therefore, thanks for having a look at this and let's start a discussion and I am happy to adjust my implementation :)

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 9, 2020

Hi, sorry it has taken me so long to get around to this.

I think the motivation behind your PR is good. Single-part uploads make sense under some conditions.

I think the execution of the idea could be a little better. First, we should leave it up to the user to decide how small is too small for a multipart upload, instead of hardcoding a threshold ourselves. So the call to smart_open would look like this instead:

with open(..., 'w', transport_params={'multipart_upload': True}) as fout:
    ...

On the implementation side, I see you've gone with conditionals inside the existing class. This is a bit too difficult to maintain, as it complicates the code unnecessarily.

I suggest you implement a separate class to take care of single-part uploading. This can be entirely separate to the existing class. This class would be extremely simple, because you'd be loading the whole single part into memory, and wouldn't need much of the logic of the existing class.

The above approach would hit two birds with one stone:

  1. Keep the existing, multi-part class as-is.
  2. Satisfy your use case.

Let me know if you're up to it.

@adrpar
Copy link
Contributor Author

adrpar commented Jan 10, 2020

Hi, thanks for looking at the PR.

I will redesign the whole approach and use a separate class as you proposed, no problem. The current way was the simplest solution for solving the use case we had and I was not sure if you would agree to have the feature. Therefore I just wanted to get your feedback and guidance first :)

I would then have the multipart_upload transport parameter default to true and if set to false use the single-part upload class.

I am up for this and hope to have the updated PR ready soon.

@adrpar adrpar force-pushed the use_direct_upload_for_small branch from 17659e5 to 4214657 Compare January 20, 2020 15:20
@adrpar adrpar changed the title hint for stream size and if small, use direct upload to S3 allow use of S3 single part uploads Jan 20, 2020
@adrpar adrpar force-pushed the use_direct_upload_for_small branch from 4214657 to 34feb3f Compare January 20, 2020 15:28
@adrpar
Copy link
Contributor Author

adrpar commented Jan 20, 2020

I split the single part upload into its own class and hope that I covered the single part upload fully with tests. I am awaiting your comments and hope, that's how you were thinking about how this should be implemented.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left you some minor comments.

smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Outdated Show resolved Hide resolved
@mpenkov
Copy link
Collaborator

mpenkov commented Jan 23, 2020

Please fix flake8 warnings in your code @adrpar

@adrpar adrpar force-pushed the use_direct_upload_for_small branch from 78b5a5e to 741b047 Compare February 2, 2020 19:43
@adrpar
Copy link
Contributor Author

adrpar commented Feb 2, 2020

Rebased and addressed your comments @mpenkov.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I reviewed your changes and left you some comments. Please have a look and let me know if you have questions.

smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Show resolved Hide resolved
@mpenkov mpenkov added this to the 1.10.0 milestone Mar 8, 2020
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 15, 2020

Thank you for the effort @adrpar . I'm merging this.

@mpenkov mpenkov merged commit 4a39fde into piskvorky:master Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants