Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add gzip/bz2 compression to read_pickle() (and perhaps other read_*() methods) #11666

Closed
gfairchild opened this issue Nov 20, 2015 · 14 comments
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@gfairchild
Copy link

Right now, read_csv() has a compression option, which allows the user to pass in a gzipped or bz2-compressed CSV file directly into Pandas to be read. It would be great if read_pickle() supported the same option. Pickles actually compress surprisingly well; I have a 567M Pandas pickle (resulting from DataFrame.to_pickle()) that packs down to 45M with pigz --best. An order of magnitude difference in size is pretty significant. This makes storing static pickles long-term as gzipped archives a very attractive option. Workflow would be made easier if Pandas could natively handle my dataframe.pickle.gz files in the same way it does compressed CSV files.

More generally, a compression option should probably be allowed for most read_* methods. Many of the read_* methods involve formats that compress very well.

@gfairchild gfairchild changed the title ENH: add gzip/bz2 compression to read_pickle() ENH: add gzip/bz2 compression to read_pickle() (and perhaps all other read_*() methods) Nov 20, 2015
@gfairchild gfairchild changed the title ENH: add gzip/bz2 compression to read_pickle() (and perhaps all other read_*() methods) ENH: add gzip/bz2 compression to read_pickle() (and perhaps other read_*() methods) Nov 20, 2015
@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

yeh, this wouldn't be hard for gzip/bz2

@jreback jreback added Enhancement IO Data IO issues that don't fit into a more specific label Difficulty Novice Compat pandas objects compatability with Numpy or Python functions labels Nov 20, 2015
@jreback jreback added this to the Next Major Release milestone Nov 20, 2015
@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

xref #5924

@maxnoe
Copy link

maxnoe commented Feb 8, 2016

Yes please. Especially for read_json!

@goldenbull
Copy link
Contributor

I like xz/lzma2 format for pickle format 😄

@jreback
Copy link
Contributor

jreback commented May 24, 2016

@goldenbull pull-requests are welcome! (this is not very difficult, more of a bit of code reorg to share the compression code)

@gfairchild
Copy link
Author

Fantastic! @jreback, are there plans by chance to implement compression support across all read_* methods since most of the formats compress well?

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

@gfairchild what do you think is useful? keeping in mind read_csv already does this, read_hdf has in-line compression. read_sql doesn't support files. read_feather doesn't support it yet.

maybe read_excel I suppose. though wonder if that's common (as you can't generally have really large excel files so does this really matter).

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

ahh I c, read_json seems a great candidate here.

@gfairchild do you want to open a new issue (xref this one and the PR). for read_json (and read_excel if you think is useful).

@gfairchild
Copy link
Author

After considering more closely, read_json is the only remaining format where this could definitely be useful. I've definitely stored large static gzipped JSON files before.

@gfairchild
Copy link
Author

Since this issue does say "perhaps other read_* methods", could we just re-open this issue? If that's not kosher, I'm happy to create another issue for this in a few hours once I get out of some meetings.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

yeah let's just create a new issue

@gfairchild
Copy link
Author

Not a problem. I'll do that in a few hours.

@gfairchild
Copy link
Author

Sorry for the delay, but I just created the issue: #15644

@goldenbull
Copy link
Contributor

@jreback I think read_excel is not neccessary because .xlsx is already a zip file by design.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#11666

Author: goldenbull <goldenbull@gmail.com>
Author: Chen Jinniu <goldenbull@users.noreply.github.com>

Closes pandas-dev#13317 from goldenbull/pickle_io_compression and squashes the following commits:

e9c5fd2 [goldenbull] docs update
d50e430 [goldenbull] update docs. re-write all tests to avoid round-trip read/write comparison.
86afd25 [goldenbull] change test to new pytest parameterized style
945e7bb [goldenbull] Merge remote-tracking branch 'origin/master' into pickle_io_compression
ccbeaa9 [goldenbull] move pickle compression tests into a new class
9a07250 [goldenbull] Remove prepared compressed data. _get_handle will take care of compressed I/O
1cb810b [goldenbull] add zip decompression support. refactor using lambda.
b8c4175 [goldenbull] add compressed pickle data file to io/tests
6df6611 [goldenbull] pickle compression code update
81d55a0 [Chen Jinniu] Merge branch 'master' into pickle_io_compression
025a0cd [goldenbull] add compression support for pickle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants