Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reading from S3 #322

Merged
merged 3 commits into from
May 29, 2019
Merged

Optimize reading from S3 #322

merged 3 commits into from
May 29, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented May 28, 2019

Fixes #317. Looks like content_length wasn't being cached in boto3.

With the updated code:

import sys

import boto3
import smart_open


def old_code(bucket, key, connection_params):
    s3 = boto3.resource('s3', **connection_params)
    obj = s3.Object(bucket, key)
    streamed_body = obj.get(Range='bytes=0-')['Body']
    return streamed_body.read(100)


def new_code(bucket, key, connection_params):
    f = smart_open.open(
        f's3://{bucket}/{key}', mode='rb',
        transport_params={'session': boto3.Session(**connection_params)},
    )
    return f.read(100)


def main():
    function_name, bucket, key = sys.argv[1:4]
    function = globals()[function_name]

    connection_params = {}
    data = function(bucket, key, connection_params)
    print(len(data))


if __name__ == '__main__':
    main()
(smart_open) misha@cabron:~/git/smart_open$ time python bug.py old_code bucket key
100

real    0m1.050s
user    0m0.274s
sys     0m0.041s
(smart_open) misha@cabron:~/git/smart_open$ time python bug.py new_code bucket key
100

real    0m1.275s
user    0m0.354s
sys     0m0.029s

We're now around 1.3 times slower than using boto3. I suspect that this is a start-up cost only, so as we read larger files, it will amortize.

@mpenkov mpenkov requested a review from piskvorky May 28, 2019 12:59
We raise ValueError for backwards compatibility, this should really be
IOError in the future.
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, so the issue was due to multiple remote HEAD calls for content_length, right?

IMO 30% overhead is acceptable (plus there seems to be some ahead-of-time buffering in smart_open, which may be related too?).

@@ -386,6 +386,8 @@ def test_nonexisting_bucket(self):
fout.write(expected)

def test_read_nonexisting_key(self):
create_bucket_and_key()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was happening before this change? Testing for a missing bucket?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonexisting key in an existing bucket.

The test started failing once we moved the content_length calls around. It's probably tripping over some moto oddity

@mpenkov mpenkov merged commit 635f539 into piskvorky:master May 29, 2019
@mpenkov mpenkov deleted the optimize branch May 29, 2019 14:18
@piskvorky
Copy link
Owner

piskvorky commented May 30, 2019

@mpenkov our github repo badge is now red, "Build failing":

Screen Shot 2019-05-30 at 14 40 11

Traceback (most recent call last):
  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/tests/test_s3.py", line 385, in test_read_nonexisting_key
    create_bucket_and_key()
NameError: global name 'create_bucket_and_key' is not defined

Some issue with the modified test?

@mpenkov
Copy link
Collaborator Author

mpenkov commented May 31, 2019

It may be a merge artifact. I'll have a look.

@mpenkov mpenkov mentioned this pull request May 31, 2019
mpenkov added a commit that referenced this pull request May 31, 2019
Two separate PRs were touching the same code. [1] modified the bucket
handling to perform it at module load time. In the meanwhile, [2] fixed
a broken test by relying on functionality that wasn't around after [1]
got merged.

[1] #318
[2] #322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening an S3 URI significantly slower than boto3
2 participants