-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create separate compression-specific layer to enable writing gzipped files #131
Conversation
Gzip functionality broken in 2.6 because gzip tries to seek around the file, which is not possible.
AttributeError: 'module' object has no attribute 'test_smart_open'
Epic! @menshikh-iv let's review ASAP. We should probably clean up and merge the smaller PRs first, since this will be a major update and release. @mpenkov what kind of backward incompatibilities do you expect here? |
@piskvorky I don't really understand the s3u stuff that was added in the meanwhile. Since that's currently effectively disabled in this branch, there is a chance of an incompatibility there. Other than that, I don't expect these changes to break anything - they still pass pretty much the same tests and work the same way. |
# Conflicts: # requirements-test.txt # smart_open/smart_open_lib.py # smart_open/tests/test_gzipstreamfile.py # smart_open/tests/test_smart_open.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smart_open/s3.py
Outdated
# | ||
# https://github.com/matthewwithanm/django-imagekit/commit/0d5bfe37517955cca2284735769fe1dffb38ed37 | ||
# | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove (we dropped 2.6 support)
smart_open/s3.py
Outdated
import six | ||
|
||
|
||
_LOGGER = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use logger
(everywhere)
|
||
return self._read_from_buffer(size) | ||
|
||
def read1(self, size=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of the interface for BufferedIOBase.
"""This is the same as read().""" | ||
return self.read(size=size) | ||
|
||
def readinto(self, b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very strange, why we need this method with mutable b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. We try to behave like a BufferedIOBase, which defines this method.
Some application downstream of smart_open may depend on this functionality.
return len(data) | ||
|
||
def terminate(self): | ||
"""Do nothing.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add pass
@menshikh-iv I made changes in response to your review. I left two items as they are because they are required to satisfy interface contracts. Let me know if I need to do anything else. Thanks! |
Thanks for clarification about |
This PR addresses issue #91. There are two major parts to this PR.
First, I rewrote all the lower-level S3 stuff using boto3 in a separate module, smart_open/s3.py. he reader and writer inherit from the io classes in the standard library, meaning that you can pretty much treat them like local file streams and pass them to codecs like gzip, bz2, etc.
Second, I integrated this module into smart_open_lib.py. This part required a rebase: I started working on this over a year ago, and much has happened since then. The part I couldn't handle well during the rebase was: there was a new
s3u
mode added, with code that depends on the old boto library. I couldn't port this part to boto3 - this functionality may be discontinued. The code is isolated in thesetup_unsecured_mode
function and is currently commented out.This takes care of #91 for S3 only. In the future, we should apply the same approach to other protocols like HDFS, more specifically: make sure readers and writers inherit from stdlib io classes.