Quilt: Reproducible Data Dependencies for Python #162

rabernat · 2018-03-15T14:42:53Z

I just stumbled across this post on the jupyter blog
https://blog.jupyter.org/reproducible-data-dependencies-for-python-guest-post-d0f68293a99

The quilt project seems to be aimed at solving many of the problems related to data discovery we have been discussing:

Quilt hides network, files, and storage behind a data package abstraction so that anyone can create durable, reproducible data dependencies for notebooks.

https://quiltdata.com/

It's a commercial product, but they have open sourced the building blocks.

By default Quilt store data in the registry at quiltdata.com. Alternatively, you can host your own registry by running the open source containers, then using quilt config to point clients to your private registry.

They seem focused on tabular-style data. But nevertheless, it's probably worth looking into this.

shoyer · 2018-03-15T17:09:57Z

I've chatted a bit with one of the Quilt founders.

I suspect they would be happy to add netCDF/xarray support to the open source client if there's demand for it (especially if they get a pull request!).

akarve · 2018-05-18T21:02:26Z

I'm a developer on Quilt. Happy to collaborate.

Unstructured and semistructured data are also supported by Quilt (e.g. large image corpuses, geojson, etc.).

In next week's quilt module we'll provide a callback that allows users to provide their own data readers (a.k.a. "deserializers"), so that packages with the xarray format (which can already be stored in Quilt) can be read out with custom libraries:

import some.geo.reader as func

from quilt.data.pangeo import ourdata

x = ourdata.foo.bar(asa=func)

We're also working on community-powered specs for reproducible data. If you'd like to be included in the discussion let me know.

jgerardsimcock · 2018-05-25T20:54:11Z

@akarve Can we already build netcdf-based packages in Quilt? If so, can you point to an example or documentation on how to do so?

akarve · 2018-07-18T19:53:57Z

In theory, yes. You can put any bits in and then use the asa= callback to call a custom deserializer. We're going to make deserialization even easier in the future. It seems people are already using Quilt + netcdf.

In practice, if you provide me with an example of data-roundtrip that you'd like to accomplish with Quilt using netcdf data, then I can try that for you and see how it might be improved.

akarve · 2018-07-18T19:58:55Z

To give you a concrete example, supposenetcdf() is a deserializer, you could do something like this:

from quilt.data.foo import bar
bar.baz(asa=netcdf())

shoyer · 2018-07-20T00:42:47Z

In practice, if you provide me with an example of data-roundtrip that you'd like to accomplish with Quilt using netcdf data, then I can try that for you and see how it might be improved.

We would typically do this with xarray, e.g.,

ds = xarray.open_dataset(path)  # netCDF file -> xarray.Dataset
ds.to_netcdf(path)  # xarray.Dataset -> netCDF file

akarve · 2018-07-20T02:55:48Z

Thanks. I can confirm that it is possible to round-trip xarray Datasets to and from a Quilt package. Here's a notebook.

It takes more lines of code than I'd like to complete this round trip, but future versions of Quilt will get this close to two lines of code.

martindurant · 2018-08-02T20:27:01Z

Intake now officially being circulated: https://www.anaconda.com/blog/developer-blog/intake-taking-the-pain-out-of-data-access/
Intake-xarray almost supports streaming from an Intake server :)

stale · 2018-10-01T20:46:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-10-08T20:52:28Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

rabernat mentioned this issue Mar 28, 2018

we need a data catalog #39

Closed

jacobtomlinson added the data access label Apr 26, 2018

stale bot added the stale label Oct 1, 2018

stale bot closed this as completed Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quilt: Reproducible Data Dependencies for Python #162

Quilt: Reproducible Data Dependencies for Python #162

rabernat commented Mar 15, 2018

shoyer commented Mar 15, 2018

akarve commented May 18, 2018

jgerardsimcock commented May 25, 2018

akarve commented Jul 18, 2018

akarve commented Jul 18, 2018

shoyer commented Jul 20, 2018

akarve commented Jul 20, 2018 •

edited

Loading

martindurant commented Aug 2, 2018

stale bot commented Oct 1, 2018

stale bot commented Oct 8, 2018

Quilt: Reproducible Data Dependencies for Python #162

Quilt: Reproducible Data Dependencies for Python #162

Comments

rabernat commented Mar 15, 2018

shoyer commented Mar 15, 2018

akarve commented May 18, 2018

jgerardsimcock commented May 25, 2018

akarve commented Jul 18, 2018

akarve commented Jul 18, 2018

shoyer commented Jul 20, 2018

akarve commented Jul 20, 2018 • edited Loading

martindurant commented Aug 2, 2018

stale bot commented Oct 1, 2018

stale bot commented Oct 8, 2018

akarve commented Jul 20, 2018 •

edited

Loading