Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring smart_open to share compression and encoding functionality #185

Merged
merged 14 commits into from
Apr 15, 2018

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Apr 7, 2018

All of our transport methods (file, S3, HDFS, WebHDFS) now read and
write bytes. A shared compression layer sits over that and performs
compression and decompression transparently. A shared encoding layer
sits over that, and performs compression decompression transparently.

The benefit of this approach is that it decouples actual I/O,
compression and encoding into independent layers. The I/O layers now
have to worry only about binary I/O. The compression all happens in one
place, so adding new codecs is simple. Finally, encoding also happens
in one place, with the same benefits.

Other things I did:

  • ripped out S3 text I/O, we do not need this anymore
  • rewrote HDFS as a IOBase-based separate module
  • split http module
  • rewrote WebHDFS subsystem based on io.IOBase
  • get rid of some unused imports

All of our transport methods (file, S3, HDFS, WebHDFS) now read and
write bytes.  A shared compression layer sits over that and performs
compression and decompression transparently.  A shared encoding layer
sits over that, and performs compression decompression transparently.

The benefit of this approach is that it decouples actual I/O,
compression and encoding into independent layers.  The I/O layers now
have to worry only about binary I/O.  The compression all happens in one
place, so adding new codecs is simple.  Finally, encoding also happens
in one place, with the same benefits.

Other things I did:

- ripped out S3 text I/O, we do not need this anymore
- rewrote HDFS as a IOBase-based separate module
- split http module
- rewrote WebHDFS subsystem based on io.IOBase
- get rid of some unused imports
@mpenkov mpenkov requested a review from menshikh-iv April 7, 2018 11:53
mpenkov added 3 commits April 7, 2018 21:04
This needs a separate implementation of seek to work under Py2.
gzip with hdfs/http wasn't supported at all prior to the refactoring, so
it is a separate issue.
By default, the system encoding is used when opening a file.
The tests expect the encoding to be UTF-8, so if the system encoding
happens to be anything else, the tests will fail.

Some Py2 installations use ascii as the default encoding.
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for me

but please check that changes for HDFS/HTTP works as expected (at least manually), later we need to resolve #151

self.assertTrue(text.startswith('В начале июля, в чрезвычайно'.encode('utf-8')))
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'.encode('utf-8')))

@unittest.skipIf(six.PY2, 'gzip support does not work on Py2')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I'm pretty sure I've been opening .gz files with smart_open in Python 2.

Copy link
Collaborator Author

@mpenkov mpenkov Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true. The master branch of smart_open currently has limited support for gzip: it works for local files and S3 only, regardless of which Python version you have installed. To the best of my understanding, on-the-fly gzip decompression never worked for HTTP, WebHDFS and HDFS. You can confirm this by running these same integration tests against master. You'll get an error similar to the following:

======================================================================
ERROR: test_read_gzip_text (__main__.ReadTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "integration-tests/test_http_copy.py", line 47, in test_read_gzip_text
    text = fin.read()
  File "/Users/misha/envs/smartopen2/lib/python2.7/codecs.py", line 486, in read
    newdata = self.stream.read()
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 261, in read
    self._read(readsize)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 295, in _read
    pos = self.fileobj.tell()   # Save current position
UnsupportedOperation: seek

Basically, Py2.7 gzip expects a .seek() operation to be implemented on the file object. Until someone explicitly implements seeking for HTTP, we won't be able to use Py2.7 gzip.

@menshikh-iv Can you please double-check and correct me if I'm wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some code here (https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L756) to address the seek issue, but it doesn't seem to be helping, because the integration test above is failing.

Copy link
Contributor

@menshikh-iv menshikh-iv Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov I checked, works fine with py2 and smart_open==1.5.7, using this code

import subprocess
import smart_open
import time

port = 8008


command = ['python', '-m', 'SimpleHTTPServer', str(port)]
s = subprocess.Popen(command)
time.sleep(1)

url = 'http://localhost:%d/smart_open/tests/test_data/crlf_at_1k_boundary.warc.gz' % port
with smart_open.smart_open(url, encoding='utf-8') as fin:
    text = fin.read()

print(text)
s.terminate()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky @menshikh-iv Thanks for checking! I can confirm your code works. I will investigate and fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a closer look at why gzip was working in Py2 despite the lack of seek. Unfortunately, it seems like it works at the expense of streaming functionality: this line reads the entire file into memory before gzip-decompressing. We could reimplement the same thing in the refactored branch, but is it worth it? We're basically surrendering the benefit of streaming without the user knowing it - it could cause out-of-memory situations on the user side if the file is sufficiently large.

@piskvorky @menshikh-iv How do you think it is best to proceed?

Copy link
Owner

@piskvorky piskvorky Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- a lack of streaming is definitely a bug. Can you open an issue for it?

Thanks for investigating @mpenkov! It's a pleasure to work with such knowledgeable and dedicated people.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Thank you! I've opened #189.

Sorry for misleading you earlier, my first investigation overlooked this buffering detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants