-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring smart_open to share compression and encoding functionality #185
Conversation
All of our transport methods (file, S3, HDFS, WebHDFS) now read and write bytes. A shared compression layer sits over that and performs compression and decompression transparently. A shared encoding layer sits over that, and performs compression decompression transparently. The benefit of this approach is that it decouples actual I/O, compression and encoding into independent layers. The I/O layers now have to worry only about binary I/O. The compression all happens in one place, so adding new codecs is simple. Finally, encoding also happens in one place, with the same benefits. Other things I did: - ripped out S3 text I/O, we do not need this anymore - rewrote HDFS as a IOBase-based separate module - split http module - rewrote WebHDFS subsystem based on io.IOBase - get rid of some unused imports
This needs a separate implementation of seek to work under Py2. gzip with hdfs/http wasn't supported at all prior to the refactoring, so it is a separate issue.
By default, the system encoding is used when opening a file. The tests expect the encoding to be UTF-8, so if the system encoding happens to be anything else, the tests will fail. Some Py2 installations use ascii as the default encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for me
but please check that changes for HDFS/HTTP works as expected (at least manually), later we need to resolve #151
integration-tests/test_http.py
Outdated
self.assertTrue(text.startswith('В начале июля, в чрезвычайно'.encode('utf-8'))) | ||
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'.encode('utf-8'))) | ||
|
||
@unittest.skipIf(six.PY2, 'gzip support does not work on Py2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I'm pretty sure I've been opening .gz
files with smart_open in Python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's true. The master branch of smart_open currently has limited support for gzip: it works for local files and S3 only, regardless of which Python version you have installed. To the best of my understanding, on-the-fly gzip decompression never worked for HTTP, WebHDFS and HDFS. You can confirm this by running these same integration tests against master. You'll get an error similar to the following:
======================================================================
ERROR: test_read_gzip_text (__main__.ReadTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "integration-tests/test_http_copy.py", line 47, in test_read_gzip_text
text = fin.read()
File "/Users/misha/envs/smartopen2/lib/python2.7/codecs.py", line 486, in read
newdata = self.stream.read()
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 261, in read
self._read(readsize)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 295, in _read
pos = self.fileobj.tell() # Save current position
UnsupportedOperation: seek
Basically, Py2.7 gzip expects a .seek() operation to be implemented on the file object. Until someone explicitly implements seeking for HTTP, we won't be able to use Py2.7 gzip.
@menshikh-iv Can you please double-check and correct me if I'm wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some code here (https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L756) to address the seek issue, but it doesn't seem to be helping, because the integration test above is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenkov I checked, works fine with py2
and smart_open==1.5.7
, using this code
import subprocess
import smart_open
import time
port = 8008
command = ['python', '-m', 'SimpleHTTPServer', str(port)]
s = subprocess.Popen(command)
time.sleep(1)
url = 'http://localhost:%d/smart_open/tests/test_data/crlf_at_1k_boundary.warc.gz' % port
with smart_open.smart_open(url, encoding='utf-8') as fin:
text = fin.read()
print(text)
s.terminate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky @menshikh-iv Thanks for checking! I can confirm your code works. I will investigate and fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a closer look at why gzip was working in Py2 despite the lack of seek. Unfortunately, it seems like it works at the expense of streaming functionality: this line reads the entire file into memory before gzip-decompressing. We could reimplement the same thing in the refactored branch, but is it worth it? We're basically surrendering the benefit of streaming without the user knowing it - it could cause out-of-memory situations on the user side if the file is sufficiently large.
@piskvorky @menshikh-iv How do you think it is best to proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- a lack of streaming is definitely a bug. Can you open an issue for it?
Thanks for investigating @mpenkov! It's a pleasure to work with such knowledgeable and dedicated people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky Thank you! I've opened #189.
Sorry for misleading you earlier, my first investigation overlooked this buffering detail.
All of our transport methods (file, S3, HDFS, WebHDFS) now read and
write bytes. A shared compression layer sits over that and performs
compression and decompression transparently. A shared encoding layer
sits over that, and performs compression decompression transparently.
The benefit of this approach is that it decouples actual I/O,
compression and encoding into independent layers. The I/O layers now
have to worry only about binary I/O. The compression all happens in one
place, so adding new codecs is simple. Finally, encoding also happens
in one place, with the same benefits.
Other things I did: