Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues #152, #153 and #154 #155

Merged
merged 13 commits into from
Dec 6, 2017
Merged

Fix issues #152, #153 and #154 #155

merged 13 commits into from
Dec 6, 2017

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Dec 2, 2017

We didn't have a readline in our S3 reader, and that was slowing us down.

Here is a breakdown of the contents:

Looks like moto doesn't work with boto anymore.
Works fine with boto3.
The existing BufferedInputBase didn't override the .readline() method,
forcing the superclass implementation to use .read() to read one byte at
a time. This slowed reading down significantly.

Also increased the buffer size to 256kB, this is consistent with
s3transfer.

http://boto3.readthedocs.io/en/latest/_modules/boto3/s3/transfer.html
smart_open/s3.py Outdated
@@ -28,6 +28,10 @@
MODES = (READ, READ_BINARY, WRITE, WRITE_BINARY)
"""Allowed I/O modes for working with S3."""

BINARY_NEWLINE = b'\n'
TEXT_NEWLINE = b'\n'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are binary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We're not using TEXT_NEWLINE right now, so I removed it.

@@ -9,7 +9,7 @@
else:
import unittest

import boto
import boto3
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a big change; does it belong here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky It's not really that big a change. This is test code, and it writes a mock object to a mock S3 bucket. The code for doing this with boto and boto3 is slightly different, but the end result is the same (the tests still pass without changing the code). If you look at the remainder of the changes in test_s3.py, you'll see the tests aren't strongly coupled to either boto or boto3.

The real benefit to using boto3 in tests is that it matches the implementation: our S3 implementation uses boto3 under the covers. Also, using boto3 is future-proof: newer versions of moto mock boto and boto3 separately. That is, objects mocked via boto are not visible to boto3 and vice versa. This means that if we upgrade to the most recent moto version (unlike the year-old version currently used in the test environment), our boto-based tests will break.

This change solves that problem before it happens.

When writing, check if the bucket exists and raise a ValueError if it
doesn't.
This is no longer necessary since we dropped support for Python 2.6.
The GzipFile from Py2.7 and above is already a context manager, so the
closing is not required.

https://docs.python.org/2/library/gzip.html
@mpenkov mpenkov changed the title Fix issue #152 Fix issue #152, #153 and #154 Dec 2, 2017
@mpenkov mpenkov changed the title Fix issue #152, #153 and #154 Fix issues #152, #153 and #154 Dec 2, 2017
If we don't specify it, smart_open will use the system encoding.  This
may be ascii on some systems (e.g. Py2.7), which will fail the test
because it contains non-ascii characters.
Renamed RawReader to SeekableRawReader.
Renamed BufferedInputBase to SeekableBufferedInputBase.
Introduced new, non-seekable RawReader and BufferedInputBase.
Seeking functionality was strictly necessary while we were supporting
Py2.6, because the gzip reader required it back then.  The gzip reader
from 2.7 onwards does not require seeking, so neither do we.

Seeking is convenient, but appears to be slower, so disabling it for now
is the right thing to do.
It appears Py2.7 gzip still requires seeking.  Py3 gzip does not.
We're still supporting Py2.7, so we need seeking to work if we continue
to use the existing gzip reader.
@menshikh-iv
Copy link
Contributor

Great fixes, thanks a lot @mpenkov 🥇

@menshikh-iv menshikh-iv merged commit 170e295 into piskvorky:master Dec 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't raise an exception on failure to CreateBucket Unable to iterate over gzipped object on S3
3 participants