Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames & Zarr #31

Open
tbenst opened this issue Apr 6, 2020 · 1 comment
Open

DataFrames & Zarr #31

tbenst opened this issue Apr 6, 2020 · 1 comment

Comments

@tbenst
Copy link

tbenst commented Apr 6, 2020

Hi, I recently learned about Zarr and very interested in it as seems to solve some issues I have with HDF5.

Increasingly, DataFrames, as popularized by R and now widely used in Python (pandas) and Julia, are a critical structure in data-science. I understand that Xarray has a Zarr backend option, but it's not clear to me if this would support interop to other languages as, say, Parquet allows.

Curious what the current state of affairs is for Zarr & DataFrames? And what the plans are for the future?

Thank you for the hard work!

@alimanfoo
Copy link
Member

Hi @tbenst, thanks for asking, it's an interesting question.

I think the short answer is that Parquet provides a good solution for dataframe storage and has good library support and community momentum, so is currently the best option for dataframe storage for use with distributed & parallel computing.

That said, back in 2016 (how time flies!) @jreback did some work exploring zarr for dataframe storage, PR is here with lots of relevant discussion in the comment thread: zarr-developers/zarr-python#84

FWIW I think zarr could be used for columnar dataframe storage, and there are some interesting differences with parquet that haven't been fully explored yet. If someone in the community is interested in working in that direction, we'd be interested in any thoughts or experience, particularly if they might influence any choices we might make in designing the v3 core protocol spec.

But the main focus of the core development team is on N-dimensional arrays, so we are unlikely to have effort to do development work in that direction ourselves.

Just my 2c, I'm not a dataframes expert so very interested to hear other views.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants