Unable to iterate over gzipped object on S3 #153

mpenkov · 2017-12-01T15:04:55Z

Encountered while reproducing issue #152.

Given reproduce.py:

import sys

from tqdm import tqdm
from smart_open import smart_open
for i, _ in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
    if i > 1e5:
        break

This reproduces the bug:

(smartopen)sergeyich:issue152 misha$ git rev-parse --short HEAD
d10166c
(smartopen)sergeyich:issue152 misha$ time python reproduce.py s3://commoncrawl/crawl-002/2010/09/25/0/1285411480200_0.arc.gz
0it [00:00, ?it/s]Traceback (most recent call last):
  File "reproduce.py", line 5, in <module>
    for i, _ in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
  File "/Users/misha/envs/smartopen/lib/python3.6/site-packages/tqdm/_tqdm.py", line 953, in __iter__
    for obj in iterable:
TypeError: 'closing' object is not iterable


real    0m3.131s
user    0m0.906s
sys     0m0.151s

The text was updated successfully, but these errors were encountered:

piskvorky · 2017-12-02T09:27:21Z

@mpenkov is it a new 1.5.4 bug, or was it there in earlier versions? I remember gzip from S3 worked in the past.

mpenkov · 2017-12-02T15:50:00Z

@piskvorky I rewrote the S3 subsystem as part of Issue #91 .

The S3 issues emerging in 1.5.4 are a consequence of that recent rewrite.

piskvorky · 2017-12-03T07:13:26Z

OK, we need to improve the testing then.

@menshikh-iv can we do a bug fix release ASAP? This is critical (more so than the performance regression).

) * use boto3 instead of boto in s3 tests Looks like moto doesn't work with boto anymore. Works fine with boto3. * Override .readline() in s3.BufferedInputBase, increase buffer size The existing BufferedInputBase didn't override the .readline() method, forcing the superclass implementation to use .read() to read one byte at a time. This slowed reading down significantly. Also increased the buffer size to 256kB, this is consistent with s3transfer. http://boto3.readthedocs.io/en/latest/_modules/boto3/s3/transfer.html * rewrite try..except as an if-else, it is faster that way * get rid of unused and incorrect TEXT_NEWLINE * Resolve Issue #154: don't create buckets if they don't exist When writing, check if the bucket exists and raise a ValueError if it doesn't. * fixup for d453af2: create the correct bucket in top-level unit tests * Resolve Issue #153: don't wrap GzipFile in contextlib.closing This is no longer necessary since we dropped support for Python 2.6. The GzipFile from Py2.7 and above is already a context manager, so the closing is not required. https://docs.python.org/2/library/gzip.html * Support errors keyword * add some integration tests, focusing on S3 only for now * Specify utf-8 encoding explicitly in tests If we don't specify it, smart_open will use the system encoding. This may be ascii on some systems (e.g. Py2.7), which will fail the test because it contains non-ascii characters. * Refactored S3 subsystem to disable seeking Renamed RawReader to SeekableRawReader. Renamed BufferedInputBase to SeekableBufferedInputBase. Introduced new, non-seekable RawReader and BufferedInputBase. Seeking functionality was strictly necessary while we were supporting Py2.6, because the gzip reader required it back then. The gzip reader from 2.7 onwards does not require seeking, so neither do we. Seeking is convenient, but appears to be slower, so disabling it for now is the right thing to do. * Re-enable seeking for S3 It appears Py2.7 gzip still requires seeking. Py3 gzip does not. We're still supporting Py2.7, so we need seeking to work if we continue to use the existing gzip reader. * fixup for 3e93fb1: point unit tests at seekable S3 object

piskvorky added the bug label Dec 2, 2017

mpenkov mentioned this issue Dec 6, 2017

Fix issues #152, #153 and #154 #155

Merged

menshikh-iv closed this as completed in #155 Dec 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to iterate over gzipped object on S3 #153

Unable to iterate over gzipped object on S3 #153

mpenkov commented Dec 1, 2017 •

edited

Loading

piskvorky commented Dec 2, 2017

mpenkov commented Dec 2, 2017

piskvorky commented Dec 3, 2017 •

edited

Loading

Unable to iterate over gzipped object on S3 #153

Unable to iterate over gzipped object on S3 #153

Comments

mpenkov commented Dec 1, 2017 • edited Loading

piskvorky commented Dec 2, 2017

mpenkov commented Dec 2, 2017

piskvorky commented Dec 3, 2017 • edited Loading

mpenkov commented Dec 1, 2017 •

edited

Loading

piskvorky commented Dec 3, 2017 •

edited

Loading