Fix issues #152, #153 and #154 #155

mpenkov · 2017-12-02T07:14:49Z

We didn't have a readline in our S3 reader, and that was slowing us down.

Here is a breakdown of the contents:

448192e, fa7c183, f7a8c56: Address Reading S3 files becomes slow after 1.5.4 #152 by added a .readline method. Does not solve the problem entirely - the regression still exists, but it's significantly smaller than before.
6f8f7b8: responding to feedback
d453af2 and 084f43e: avoid creating buckets if they do not exist (resolves Don't raise an exception on failure to CreateBucket #154)
8017985, a3f9fe1, 2188b2b: added handling for the errors keyword argument
6c22ed8, 3e93fb1, 14588a5: performed a refactoring for disabling seek functionality. This improves speed under Py3, but breaks under Py2.7 because the gzip reader needs seek under that Python version. The last commit re-enables seek, but keeps the refactoring, in case we want to do more with it later.
2c79505 - fix Unable to iterate over gzipped object on S3 #153

Looks like moto doesn't work with boto anymore. Works fine with boto3.

The existing BufferedInputBase didn't override the .readline() method, forcing the superclass implementation to use .read() to read one byte at a time. This slowed reading down significantly. Also increased the buffer size to 256kB, this is consistent with s3transfer. http://boto3.readthedocs.io/en/latest/_modules/boto3/s3/transfer.html

piskvorky · 2017-12-02T09:18:06Z

smart_open/s3.py

@@ -28,6 +28,10 @@
 MODES = (READ, READ_BINARY, WRITE, WRITE_BINARY)
 """Allowed I/O modes for working with S3."""

+BINARY_NEWLINE = b'\n'
+TEXT_NEWLINE = b'\n'


Both are binary?

You're right. We're not using TEXT_NEWLINE right now, so I removed it.

piskvorky · 2017-12-02T09:19:18Z

smart_open/tests/test_s3.py

@@ -9,7 +9,7 @@
 else:
    import unittest

-import boto
+import boto3


This seems to be a big change; does it belong here?

@piskvorky It's not really that big a change. This is test code, and it writes a mock object to a mock S3 bucket. The code for doing this with boto and boto3 is slightly different, but the end result is the same (the tests still pass without changing the code). If you look at the remainder of the changes in test_s3.py, you'll see the tests aren't strongly coupled to either boto or boto3.

The real benefit to using boto3 in tests is that it matches the implementation: our S3 implementation uses boto3 under the covers. Also, using boto3 is future-proof: newer versions of moto mock boto and boto3 separately. That is, objects mocked via boto are not visible to boto3 and vice versa. This means that if we upgrade to the most recent moto version (unlike the year-old version currently used in the test environment), our boto-based tests will break.

This change solves that problem before it happens.

When writing, check if the bucket exists and raise a ValueError if it doesn't.

This is no longer necessary since we dropped support for Python 2.6. The GzipFile from Py2.7 and above is already a context manager, so the closing is not required. https://docs.python.org/2/library/gzip.html

If we don't specify it, smart_open will use the system encoding. This may be ascii on some systems (e.g. Py2.7), which will fail the test because it contains non-ascii characters.

Renamed RawReader to SeekableRawReader. Renamed BufferedInputBase to SeekableBufferedInputBase. Introduced new, non-seekable RawReader and BufferedInputBase. Seeking functionality was strictly necessary while we were supporting Py2.6, because the gzip reader required it back then. The gzip reader from 2.7 onwards does not require seeking, so neither do we. Seeking is convenient, but appears to be slower, so disabling it for now is the right thing to do.

It appears Py2.7 gzip still requires seeking. Py3 gzip does not. We're still supporting Py2.7, so we need seeking to work if we continue to use the existing gzip reader.

menshikh-iv · 2017-12-06T09:10:30Z

Great fixes, thanks a lot @mpenkov 🥇

mpenkov added 3 commits December 2, 2017 12:37

use boto3 instead of boto in s3 tests

448192e

Looks like moto doesn't work with boto anymore. Works fine with boto3.

rewrite try..except as an if-else, it is faster that way

f7a8c56

piskvorky reviewed Dec 2, 2017

View reviewed changes

mpenkov added 4 commits December 3, 2017 00:47

get rid of unused and incorrect TEXT_NEWLINE

6f8f7b8

Resolve Issue #154: don't create buckets if they don't exist

d453af2

When writing, check if the bucket exists and raise a ValueError if it doesn't.

fixup for d453af2: create the correct bucket in top-level unit tests

084f43e

Resolve Issue #153: don't wrap GzipFile in contextlib.closing

2c79505

This is no longer necessary since we dropped support for Python 2.6. The GzipFile from Py2.7 and above is already a context manager, so the closing is not required. https://docs.python.org/2/library/gzip.html

mpenkov changed the title ~~Fix issue #152~~ Fix issue #152, #153 and #154 Dec 2, 2017

mpenkov changed the title ~~Fix issue #152, #153 and #154~~ Fix issues #152, #153 and #154 Dec 2, 2017

mpenkov added 6 commits December 3, 2017 16:59

Support errors keyword

8017985

add some integration tests, focusing on S3 only for now

a3f9fe1

Specify utf-8 encoding explicitly in tests

2188b2b

If we don't specify it, smart_open will use the system encoding. This may be ascii on some systems (e.g. Py2.7), which will fail the test because it contains non-ascii characters.

Re-enable seeking for S3

3e93fb1

It appears Py2.7 gzip still requires seeking. Py3 gzip does not. We're still supporting Py2.7, so we need seeking to work if we continue to use the existing gzip reader.

fixup for 3e93fb1: point unit tests at seekable S3 object

14588a5

menshikh-iv merged commit 170e295 into piskvorky:master Dec 6, 2017

menshikh-iv mentioned this pull request Dec 6, 2017

Reading S3 files becomes slow after 1.5.4 #152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues #152, #153 and #154 #155

Fix issues #152, #153 and #154 #155

mpenkov commented Dec 2, 2017 •

edited

Loading

piskvorky Dec 2, 2017

mpenkov Dec 2, 2017

piskvorky Dec 2, 2017

mpenkov Dec 2, 2017

menshikh-iv commented Dec 6, 2017

Fix issues #152, #153 and #154 #155

Fix issues #152, #153 and #154 #155

Conversation

mpenkov commented Dec 2, 2017 • edited Loading

piskvorky Dec 2, 2017

Choose a reason for hiding this comment

mpenkov Dec 2, 2017

Choose a reason for hiding this comment

piskvorky Dec 2, 2017

Choose a reason for hiding this comment

mpenkov Dec 2, 2017

Choose a reason for hiding this comment

menshikh-iv commented Dec 6, 2017

mpenkov commented Dec 2, 2017 •

edited

Loading