Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to iterate over gzipped object on S3 #153

Closed
mpenkov opened this issue Dec 1, 2017 · 3 comments · Fixed by #155
Closed

Unable to iterate over gzipped object on S3 #153

mpenkov opened this issue Dec 1, 2017 · 3 comments · Fixed by #155
Labels

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 1, 2017

Encountered while reproducing issue #152.

Given reproduce.py:

import sys

from tqdm import tqdm
from smart_open import smart_open
for i, _ in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
    if i > 1e5:
        break

This reproduces the bug:

(smartopen)sergeyich:issue152 misha$ git rev-parse --short HEAD
d10166c
(smartopen)sergeyich:issue152 misha$ time python reproduce.py s3://commoncrawl/crawl-002/2010/09/25/0/1285411480200_0.arc.gz
0it [00:00, ?it/s]Traceback (most recent call last):
  File "reproduce.py", line 5, in <module>
    for i, _ in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
  File "/Users/misha/envs/smartopen/lib/python3.6/site-packages/tqdm/_tqdm.py", line 953, in __iter__
    for obj in iterable:
TypeError: 'closing' object is not iterable


real    0m3.131s
user    0m0.906s
sys     0m0.151s
@piskvorky piskvorky added the bug label Dec 2, 2017
@piskvorky
Copy link
Owner

@mpenkov is it a new 1.5.4 bug, or was it there in earlier versions? I remember gzip from S3 worked in the past.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Dec 2, 2017

@piskvorky I rewrote the S3 subsystem as part of Issue #91 .

The S3 issues emerging in 1.5.4 are a consequence of that recent rewrite.

@piskvorky
Copy link
Owner

piskvorky commented Dec 3, 2017

OK, we need to improve the testing then.

@menshikh-iv can we do a bug fix release ASAP? This is critical (more so than the performance regression).

menshikh-iv pushed a commit that referenced this issue Dec 6, 2017
)

* use boto3 instead of boto in s3 tests

Looks like moto doesn't work with boto anymore.
Works fine with boto3.

* Override .readline() in s3.BufferedInputBase, increase buffer size

The existing BufferedInputBase didn't override the .readline() method,
forcing the superclass implementation to use .read() to read one byte at
a time. This slowed reading down significantly.

Also increased the buffer size to 256kB, this is consistent with
s3transfer.

http://boto3.readthedocs.io/en/latest/_modules/boto3/s3/transfer.html

* rewrite try..except as an if-else, it is faster that way

* get rid of unused and incorrect TEXT_NEWLINE

* Resolve Issue #154: don't create buckets if they don't exist

When writing, check if the bucket exists and raise a ValueError if it
doesn't.

* fixup for d453af2: create the correct bucket in top-level unit tests

* Resolve Issue #153: don't wrap GzipFile in contextlib.closing

This is no longer necessary since we dropped support for Python 2.6.
The GzipFile from Py2.7 and above is already a context manager, so the
closing is not required.

https://docs.python.org/2/library/gzip.html

* Support errors keyword

* add some integration tests, focusing on S3 only for now

* Specify utf-8 encoding explicitly in tests

If we don't specify it, smart_open will use the system encoding.  This
may be ascii on some systems (e.g. Py2.7), which will fail the test
because it contains non-ascii characters.

* Refactored S3 subsystem to disable seeking

Renamed RawReader to SeekableRawReader.
Renamed BufferedInputBase to SeekableBufferedInputBase.
Introduced new, non-seekable RawReader and BufferedInputBase.
Seeking functionality was strictly necessary while we were supporting
Py2.6, because the gzip reader required it back then.  The gzip reader
from 2.7 onwards does not require seeking, so neither do we.

Seeking is convenient, but appears to be slower, so disabling it for now
is the right thing to do.

* Re-enable seeking for S3

It appears Py2.7 gzip still requires seeking.  Py3 gzip does not.
We're still supporting Py2.7, so we need seeking to work if we continue
to use the existing gzip reader.

* fixup for 3e93fb1: point unit tests at seekable S3 object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants