-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to iterate over gzipped object on S3 #153
Labels
Comments
@mpenkov is it a new 1.5.4 bug, or was it there in earlier versions? I remember gzip from S3 worked in the past. |
@piskvorky I rewrote the S3 subsystem as part of Issue #91 . The S3 issues emerging in 1.5.4 are a consequence of that recent rewrite. |
OK, we need to improve the testing then. @menshikh-iv can we do a bug fix release ASAP? This is critical (more so than the performance regression). |
menshikh-iv
pushed a commit
that referenced
this issue
Dec 6, 2017
) * use boto3 instead of boto in s3 tests Looks like moto doesn't work with boto anymore. Works fine with boto3. * Override .readline() in s3.BufferedInputBase, increase buffer size The existing BufferedInputBase didn't override the .readline() method, forcing the superclass implementation to use .read() to read one byte at a time. This slowed reading down significantly. Also increased the buffer size to 256kB, this is consistent with s3transfer. http://boto3.readthedocs.io/en/latest/_modules/boto3/s3/transfer.html * rewrite try..except as an if-else, it is faster that way * get rid of unused and incorrect TEXT_NEWLINE * Resolve Issue #154: don't create buckets if they don't exist When writing, check if the bucket exists and raise a ValueError if it doesn't. * fixup for d453af2: create the correct bucket in top-level unit tests * Resolve Issue #153: don't wrap GzipFile in contextlib.closing This is no longer necessary since we dropped support for Python 2.6. The GzipFile from Py2.7 and above is already a context manager, so the closing is not required. https://docs.python.org/2/library/gzip.html * Support errors keyword * add some integration tests, focusing on S3 only for now * Specify utf-8 encoding explicitly in tests If we don't specify it, smart_open will use the system encoding. This may be ascii on some systems (e.g. Py2.7), which will fail the test because it contains non-ascii characters. * Refactored S3 subsystem to disable seeking Renamed RawReader to SeekableRawReader. Renamed BufferedInputBase to SeekableBufferedInputBase. Introduced new, non-seekable RawReader and BufferedInputBase. Seeking functionality was strictly necessary while we were supporting Py2.6, because the gzip reader required it back then. The gzip reader from 2.7 onwards does not require seeking, so neither do we. Seeking is convenient, but appears to be slower, so disabling it for now is the right thing to do. * Re-enable seeking for S3 It appears Py2.7 gzip still requires seeking. Py3 gzip does not. We're still supporting Py2.7, so we need seeking to work if we continue to use the existing gzip reader. * fixup for 3e93fb1: point unit tests at seekable S3 object
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Encountered while reproducing issue #152.
Given reproduce.py:
This reproduces the bug:
The text was updated successfully, but these errors were encountered: