-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Refactor compression code to expand URL support #14576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for the new features you added? (the compressed sources from urls)
@@ -276,7 +276,7 @@ def file_path_to_url(path): | |||
|
|||
# ZipFile is not a context manager for <= 2.6 | |||
# must be tuple index here since 2.6 doesn't use namedtuple for version_info | |||
if sys.version_info[1] <= 6: | |||
if compat.PY2 and sys.version_info[1] <= 6: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support python 2.6 anymore, so I think this check can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 removed in 28f19f1.
f = _get_handle(f, 'r', encoding=self.encoding, | ||
compression=self.compression, | ||
memory_map=self.memory_map) | ||
self.handles.append(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line (adding each f
to handles
) will give problems, as only the handles we opened ourselves should be added, not handles passed by the user (previously those were only added inside the if statements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the effect in the failure of the test_file_handles test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I attempted a fix in 2f43c39, but it's not very elegant -- logic is repeated in _get_handle
and PythonParser.__init__
.
What do you think is the best solution? What situation do we not end up opening a handle ourselves? Is it a problem that in PY3 we pass path_or_buf
through io.TextIOWrapper
before adding it to handles
?
@@ -285,53 +285,84 @@ def ZipFile(*args, **kwargs): | |||
ZipFile = zipfile.ZipFile | |||
|
|||
|
|||
def _get_handle(path, mode, encoding=None, compression=None, memory_map=False): | |||
"""Gets file handle for given path and mode. | |||
def _get_handle(source, mode, encoding=None, compression=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this path_or_buf
in-line with spellings elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 673bde0
@jorisvandenbossche, do you want me to add xz/zip/bz2 compressed versions of |
Are we also going to have to do something to |
@dhimmel yes that should be more general |
Create compressed versions of the salary dataset for testing pandas-dev#14576. Rename `salary.table.csv` to `salaries.tsv` because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename `salary.table.gz` to `salaries.tsv.gz`, since compressed files should append to not strip the original extension. Created new files by running the following commands: ```sh cd pandas/io/tests/parser/data bzip2 --keep salaries.tsv xz --keep salaries.tsv zip salaries.tsv.zip salaries.tsv ```
url = self.base_url + '.zip' | ||
url_table = read_table(url, compression="zip", engine="python") | ||
tm.assert_frame_equal(url_table, self.local_table) | ||
def test_compressed_urls(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche, to try and reduce the repetition of the testing code, I made this commit (28373f1) which uses nose test generators. However, the nose doc states:
Please note that method generators are not supported in unittest.TestCase subclasses.
As a result, the tests yielded by test_compressed_urls
aren't actually running. I'm attempting to run the test using:
nosetests --verbosity=3 pandas/io/tests/parser/test_network.py:TestCompressedUrl.test_compressed_urls
Do you know the best way to proceed? I'm new to nose and pretty new to testing, so any advice will be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we run yielded test in computation and io/pickle
will be much easier when we switch to pytest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see those for some examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I see the test generator examples in io/tests/test_pickle.py
, computation/tests/test_compat.py
, and computation/tests/test_eval.py
. Those examples don't have a setUp
method nor do they use the pandas.util.testing.network
decorator. So I did my best in 272669b. It seems that one side effect is that the failure messages are less informative?
Current coverage is 85.29% (diff: 88.40%)@@ master #14576 diff @@
==========================================
Files 144 144
Lines 50980 50943 -37
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43470 43453 -17
+ Misses 7510 7490 -20
Partials 0 0
|
272669b
to
4a5d3e7
Compare
from io import TextIOWrapper | ||
return TextIOWrapper(path_or_buf, encoding=encoding) | ||
|
||
elif compression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be infer
here? (IIRC we have an inference function based on the file ext). Is this guaranteed to be a valid compression? (its fine, though I think easy to handle it here anyhow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this function (_get_handle
) cannot consume compression=infer
. However, it would be an easy addition:
if compression == 'infer':
assert is_path
compression = infer_compression(path_or_buff)
One issue I'm having is that _get_handle
was undocumented before this PR. So it's a bit hard for me to know what to implement. @jreback or @jorisvandenbossche: can you comment on the tentative docstring and fill in the missing details, so I know exactly what to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhimmel What is still missing in your current docstring?
Adding compression='infer'
here only makes sense if it would be used of course. Where happens the inference of the compression now? Because maybe it is already inferred before it is passed to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where happens the inference of the compression now?
@jorisvandenbossche, inferrence currently happens in pandas/io/parsers.py
.
I don't think compression == 'infer'
will ever make it this far with the current design. @jreback does this answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhimmel that's fine. I just want to validate somewhere that compression in [None, ......]
or raise a ValueError
def test_url_gz_infer(self): | ||
url = 'https://s3.amazonaws.com/pandas-test/salary.table.gz' | ||
url_table = read_table(url, compression="infer", engine="python") | ||
class TestCompressedUrl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inherit from object
ideally if you can use some of this code in the parser routines would be great (may not be completely possible, but want to consolidate some of that duplication) |
Regarding For example, |
i think inferring based on the file name is usually enough - but u can also use the Content if it's available |
f = lzma.LZMAFile(path, mode) | ||
f = lzma.LZMAFile(path_or_buf, mode) | ||
|
||
# Unrecognized Compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh ok we are doing the validation, nvm. then. (assume we have some tests that hit this path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume we have some tests that hit this path
Actually, I don't think there were any tests for invalid compression (regardless of URL or not), so I added one in f2ce8f8.
# in Python 3, convert BytesIO or fileobjects passed with an encoding | ||
elif compat.PY3 and isinstance(f, compat.BytesIO): | ||
from io import TextIOWrapper | ||
add_handle = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could be a utility function in io/common
(unless its not used anywhere else)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think add_handle
is only used here -- I created it for this PR, so nowhere else depends on it.
I'm still a little unsure about the purpose of self.handles
-- should it only contain file handles because right now we're adding handles that aren't files (for example in the case of compat.PY3 and isinstance(f, compat.BytesIO)
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the self.handles
is to allow things to be closed if WE (meaning pandas opened) them, e.g. when you wrap with TextWrapper
we NEED to close them. Otherwise we try to preserve open files handles (that were passed to us)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to do this would be to handle this inside _get_handle
and return from that function handles
that should be added. IMO that would be more maintainable, as now this boolean expression here depends on the content of the _get_handle
function, so better to do it there.
'gzip': '.gz', | ||
'bz2': '.bz2', | ||
'zip': '.zip', | ||
'xz': '.xz', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a test for invalid compression specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want a specific test for invalid compression specified for a URL? As of the latest commit (bff0f3e), compression inference has been consolidated and invalid compression raises an error before compression is attempted.
content_encoding = req.headers.get('Content-Encoding', None) | ||
if content_encoding == 'gzip': | ||
# Override compression based on Content-Encoding header | ||
compression = 'gzip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think inferring based on the file name is usually enough - but u can also use the Content if it's available
@jreback, Content-Encoding
is not a versatile way to infer compression -- gzip
is the only compression encoding that pandas and the Content-Encoding specification both support. I've modified it so inference on a URL uses the extension, just as a path would. However, if Content-Encoding
is gzip
, then compression is overridden. I don't expect this situation to arise very often. Does this make sense or should we just totally ignore Content-Encoding
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree with your assements. Since this was here, we can leave it. But I agree the filename inference is better / more standard.
I'm getting failing builds in AppVeyor in Python 2 due to def import_lzma():
""" import the backported lzma library
or raise ImportError if not available """
from backports import lzma
return lzma Any idea why Update: I think I resolved this in af3c2bc. |
Assuming the tests pass for 640dc1b, I don't think there are any more unaddressed comments.
@jreback rebased with appveyor now passing.
@jorisvandenbossche: I created #14874 and #14875.
@jorisvandenbossche do you want me to squash all of my commits? If so, what should go in the commit message? Should I add the issues we're closing to the commit message? |
@dhimmel you can squash or it will get squashed on the merge. in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great. We could in this case cherry-pick them (rather than completely squashing), to preserver authoriship (as I think when we squash ALL then authoriship is the last one). if we only have 2 commits a merge is easy. |
I'm going to squash all of my commits. Grouping them will be pretty burdensome and the intermediate commits were often not functional. I created a branch on Just to list things here, 640dc1b Simplify TextIOWrapper usage 36 commits total in this PR. Wow! |
Part of pandas-dev#14576. Closes pandas-dev#12688. Closes pandas-dev#14570. Opens pandas-dev#14874.
Part of pandas-dev#14576. Closes pandas-dev#12688. Closes pandas-dev#14570. Opens pandas-dev#14874.
Great we're now at 2 commits. I think merge makes sense, because it is nice to see the diff of both commits combined. Alone the changesets will be incomplete. Let me know if there's anything more for me to do. |
@jreback You can use the github interface as well to "rebase and merge" instead of squashing, which will just apply the commits in this PR to master. And so the two commits now in the PR are perfect! |
@jorisvandenbossche right, but this is the exception to the rule. I will have a look in a bit. @dhimmel going to merge, and can followup on comments / issues in another PR. |
thanks @dhimmel very nice! in your xref PR, pls add whatsnew that cover all of the issues that were closed (for 0.20.0) |
@dhimmel thanks a lot! |
Add what's new corresponding to pandas-dev#14576.
Add what's new corresponding to pandas-dev#14576.
Add what's new corresponding to pandas-dev#14576.
…th C engine (GH14874) Follow up on #14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes #14874 Author: Daniel Himmelstein <daniel.himmelstein@gmail.com> Closes #14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
closes pandas-dev#14576 closes pandas-dev#12688 closes pandas-dev#14570 xref pandas-dev#14874
…th C engine (GH14874) Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874 Author: Daniel Himmelstein <daniel.himmelstein@gmail.com> Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
…th C engine (GH14874) Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874 Author: Daniel Himmelstein <daniel.himmelstein@gmail.com> Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
Work in progress. Picks up where #13340 left off.
closes #12688
closes #13340
closes #14570