Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quilt: Reproducible Data Dependencies for Python #162

Closed
rabernat opened this issue Mar 15, 2018 · 10 comments
Closed

Quilt: Reproducible Data Dependencies for Python #162

rabernat opened this issue Mar 15, 2018 · 10 comments

Comments

@rabernat
Copy link
Member

I just stumbled across this post on the jupyter blog
https://blog.jupyter.org/reproducible-data-dependencies-for-python-guest-post-d0f68293a99

The quilt project seems to be aimed at solving many of the problems related to data discovery we have been discussing:

Quilt hides network, files, and storage behind a data package abstraction so that anyone can create durable, reproducible data dependencies for notebooks.

https://quiltdata.com/

It's a commercial product, but they have open sourced the building blocks.

By default Quilt store data in the registry at quiltdata.com. Alternatively, you can host your own registry by running the open source containers, then using quilt config to point clients to your private registry.

They seem focused on tabular-style data. But nevertheless, it's probably worth looking into this.

@shoyer
Copy link

shoyer commented Mar 15, 2018

I've chatted a bit with one of the Quilt founders.

I suspect they would be happy to add netCDF/xarray support to the open source client if there's demand for it (especially if they get a pull request!).

@akarve
Copy link

akarve commented May 18, 2018

I'm a developer on Quilt. Happy to collaborate.

Unstructured and semistructured data are also supported by Quilt (e.g. large image corpuses, geojson, etc.).

In next week's quilt module we'll provide a callback that allows users to provide their own data readers (a.k.a. "deserializers"), so that packages with the xarray format (which can already be stored in Quilt) can be read out with custom libraries:

import some.geo.reader as func

from quilt.data.pangeo import ourdata

x = ourdata.foo.bar(asa=func)

We're also working on community-powered specs for reproducible data. If you'd like to be included in the discussion let me know.

@jgerardsimcock
Copy link

@akarve Can we already build netcdf-based packages in Quilt? If so, can you point to an example or documentation on how to do so?

@akarve
Copy link

akarve commented Jul 18, 2018

In theory, yes. You can put any bits in and then use the asa= callback to call a custom deserializer. We're going to make deserialization even easier in the future. It seems people are already using Quilt + netcdf.

In practice, if you provide me with an example of data-roundtrip that you'd like to accomplish with Quilt using netcdf data, then I can try that for you and see how it might be improved.

@akarve
Copy link

akarve commented Jul 18, 2018

To give you a concrete example, supposenetcdf() is a deserializer, you could do something like this:

from quilt.data.foo import bar
bar.baz(asa=netcdf())

@shoyer
Copy link

shoyer commented Jul 20, 2018

In practice, if you provide me with an example of data-roundtrip that you'd like to accomplish with Quilt using netcdf data, then I can try that for you and see how it might be improved.

We would typically do this with xarray, e.g.,

ds = xarray.open_dataset(path)  # netCDF file -> xarray.Dataset
ds.to_netcdf(path)  # xarray.Dataset -> netCDF file

@akarve
Copy link

akarve commented Jul 20, 2018

Thanks. I can confirm that it is possible to round-trip xarray Datasets to and from a Quilt package. Here's a notebook.

It takes more lines of code than I'd like to complete this round trip, but future versions of Quilt will get this close to two lines of code.

@martindurant
Copy link
Contributor

Intake now officially being circulated: https://www.anaconda.com/blog/developer-blog/intake-taking-the-pain-out-of-data-access/
Intake-xarray almost supports streaming from an Intake server :)

@stale
Copy link

stale bot commented Oct 1, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 1, 2018
@stale
Copy link

stale bot commented Oct 8, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants