Skip to content

Add Zip file functionality. Fixes #11413 #12103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

Add Zip file functionality. Fixes #11413 #12103

wants to merge 11 commits into from

Conversation

lababidi
Copy link

closes #11413

This PR leverages Python's ZipFile functionality to automatically unzip files read into DataFrames using read_csv().


result = self.read_csv(open(path, 'rb'), compression='zip')
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a test for multiple files in a zip (e.g. assert that the ValueError is raised)

@jreback jreback added Enhancement IO Data IO issues that don't fit into a more specific label labels Jan 21, 2016
@jreback
Copy link
Contributor

jreback commented Jan 21, 2016

as an aside, would be interested in solving issues brought up in #11666 and partially addressed in #11677 (not merged)

@lababidi
Copy link
Author

@jreback I would be happy to continue this to the other read_* functions. I like the idea of refactoring out the compression determination and the decompression steps, so that it can be used in all the read_* functions. I won't use the PR #11677 because it does not include tests or zip functionality. It also only focuses on pickles.

@lababidi
Copy link
Author

@jreback Could you merge this in and I'll take care of the other requests?

@@ -61,9 +61,9 @@ class ParserWarning(Warning):
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python')
compression : {'gzip', 'bz2', 'infer', None}, default 'infer'
compression : {'gzip', 'bz2', 'zip', 'infer', None}, default 'infer'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an explanation about .zip is only a single file

In the description for the parser, a warning/comment is made that a zip file may only contain one file that needs to be read in. If more than one file is compressed into the ZIP file, a ValueError is thrown.
source = zip_file.open(file_name)

elif len(zip_names)>1:
raise ValueError('Multiple files found in compressed '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just do else here (e.g. you can have 0 files in an archive?)

Mahmoud Lababidi added 3 commits January 26, 2016 17:54
…est_gzip, test_bz2, test_zip. Add tests for python and c engines.
Conflicts:
	pandas/io/tests/test_parsers.py
@lababidi
Copy link
Author

@jreback could you please merge this before you make anymore changes to the tests.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

you need to rebase then I'll review

Mahmoud Lababidi and others added 5 commits January 29, 2016 08:53
In the description for the parser, a warning/comment is made that a zip file may only contain one file that needs to be read in. If more than one file is compressed into the ZIP file, a ValueError is thrown.
…est_gzip, test_bz2, test_zip. Add tests for python and c engines.
@lababidi
Copy link
Author

@jreback Rebased. Thanks.

@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

you need to force push. you should only have your commits. see here

@lababidi
Copy link
Author

@jreback I'm redoing this Pull Request. I will close this one and open a new PR.

@lababidi lababidi closed this Jan 29, 2016
@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

ok, in the future, just push to the same one

@stoffprof
Copy link

Can the same functionality be added to read_fwf() and other read_* methods?

@jreback
Copy link
Contributor

jreback commented May 4, 2016

#11666 is related
#12688 is where this could be done, e.g. the code is a spread out a bit. welcome for you to take a crack at it @Itzybitzy

@jreback
Copy link
Contributor

jreback commented May 4, 2016

IOW, the compression interfaces need to be pulled out a bit from the parser code

@jreback jreback added this to the 0.18.1 milestone May 4, 2016
@stoffprof
Copy link

Sorry @jreback, I just don't have the skills to do that (yet). Is there something good for a beginner to work on? Documentation maybe? (Not even sure where the right place to ask this is.)

@jreback
Copy link
Contributor

jreback commented May 5, 2016

http://pandas.pydata.org/pandas-docs/stable/contributing.html

selects label of difficulty novice and you will see lots of issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why isn't zip compression included for read_csv?
3 participants