Example of using intake in a notebook #231

martindurant · 2018-05-01T17:08:40Z

Ref : #39

Here is an example of how this would work. I have only edited the xarray-data notebook. Essentially, all of the options that were in the original open function calls go into arguments in the catalogue file. We use the same conventions for URLs as dask main, and (read-only) mappers are created automatically. This latter step is probably something that should eventually be made generic for all file-systems.

jhamman · 2018-05-03T21:16:07Z

@martindurant - thanks for putting this together. A few thoughts for you and @rabernat and @mrocklin:

I think the abstraction away from the gcsfs / open_zarr piece is really nice here. Do we, however, run the risk of sweeping some important details under the rug? The process of connecting to these data sources is quite new and it may be instructive for some users to see both the intake and gcsfs/brute-forge methods side by side.
There are edits to the docker files here. This will require redeploying the cluster for these changes to take. Do we have any hesitations adding intake as a dependency? (I don't).
How do we make sure everyone get's access to this yaml file. Can we host this somewhere on the web? I guess right now we're copying this over through the docker image but it may be better (long term) to have it stored elsewhere.

mrocklin · 2018-05-03T21:18:29Z

There are edits to the docker files here. This will require redeploying the cluster for these changes to take. Do we have any hesitations adding intake as a dependency? (I don't).

Short term presumably people can pip install intake in their local environments to try things out?

mrocklin · 2018-05-03T21:19:58Z

I agree that this does look pretty nice :)

mrocklin · 2018-05-03T21:21:17Z

I guess right now we're copying this over through the docker image

This is typically how we control things that we control. It's not a bad medium-term solution.

but it may be better (long term) to have it stored elsewhere

Agreed. Can intake point to a file on the web?

martindurant · 2018-05-03T21:24:10Z

Agreed. Can intake point to a file on the web?

You mean a remote cat file?
Not at the moment. There is a convention that http(s) means connecting to an intake server, but I see no reason why we shouldn't name our own protocol for doing that (intake://) cc @seibert

mrocklin · 2018-05-03T21:26:32Z

Additionally, we might want to control that for the user as an environment variable

export INTAKE_CATALOG=/path/to/default/catalog

This way they just have to do something like the following:

import intake
catalog = intake.Catalog()
catalog.my_dataset

Honestly, it wouldn't be unwelcome to skip the middle line

import intake  # looks for environment variable, auto-populates global catalog
intake.my_dataset

That might be too much though...

martindurant · 2018-05-03T21:29:56Z

We do have the concept of a built-in catalogue and of conda packages that populate it, so you can conda-install and have the entries available under intake.cat; these packages may include data or download their own data or point to remote data. That would be another way to distribute a pangeo cat.

martindurant · 2018-05-03T21:30:52Z

Of course, intake is just reaching release now, and we are working on the documentation to make these points clear.

mrocklin · 2018-05-03T21:33:55Z

Can we populate the built-in catalog from the remote web resource? My guess is that it will be easier for us to publish a yaml file on some github repository than it will be for us to publish conda packages

…

On Thu, May 3, 2018 at 5:30 PM, Martin Durant ***@***.***> wrote: Of course, intake is just reaching release now, and we are working on the documentation to make these points clear. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#231 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszILOm-iSHS7Sb3V3oCXyJISu_41kks5tu3cMgaJpZM4TuNLM> .

martindurant · 2018-05-06T16:17:37Z

intake/intake#87
Work to allow reading cat files from remote sources.

jacobtomlinson · 2018-05-09T12:07:12Z

@niallrobinson

This shows that the data source listing can be remote too, using the `master` version of Intake. Now the Intake parts are given as comments in the notebooks, so that users can try it without forcing a dependence of Intake on anyone. Notes - intake is not included in the installation, so we need either instructions or to add to the environment - the cat file is hosted on my bucket, but publicly accessible. If the idea seems like a good one, it would presumably move to pangeo's bucket.

mrocklin · 2018-05-10T17:06:58Z

gce/notebook/examples/sea-surface-height.ipynb

+    "### Alternate method to load the data using Intake\n",
+    "### Where the remote YAML spec points to the data on GCS\n",
+    "# import intake\n",
+    "# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n",


Thanks for doing this.

Thoughts on using a file in this github repository instead? Perhaps in pangeo/gce/catalog.yaml? I suspect that more people within the community are familiar with editing github repositories than modifying files on GCS.

Agreed. I would prefer to have the catalog live in a repo. That way we will have a clear process for updating it.

If you like, I can make a new repo for that purpose right now under the pangeo-data org.

That would work fine as well. I don't know where the right place for this would be, so happy to use whatever is convenient.

mrocklin · 2018-05-10T17:24:01Z

I suggest that we keep it in this repo for a while and then spin out in the future after things solidify

…

On Thu, May 10, 2018 at 1:23 PM, Ryan Abernathey ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In gce/notebook/examples/sea-surface-height.ipynb <#231 (comment)>: > @@ -51,6 +51,19 @@ "ds" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Alternate method to load the data using Intake\n", + "### Where the remote YAML spec points to the data on GCS\n", + "# import intake\n", + "# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n", Agreed. I would prefer to have the catalog live in a repo. That way we will have a clear process for updating it. If you like, I can make a new repo for that purpose right now under the pangeo-data org. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#231 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszM1NNY_Ms7sBDSWxKfL_r-4eC8V8ks5txHd5gaJpZM4TuNLM> .

martindurant · 2018-05-10T17:46:39Z

For the time being, the catalog file is checked in here too, so it can just be a local file in the notebook or point to the remote github location, which would then update whenever the repo was updated.

mrocklin · 2018-05-10T18:12:00Z

I recommend that we check it in here, but point to the web address in the notebook https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/catalog.yaml

…

On Thu, May 10, 2018 at 1:46 PM, Martin Durant ***@***.***> wrote: For the time being, the catalog file is checked in here too, so it can just be a local file in the notebook or point to the remote github location, which would then update whenever the repo was updated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#231 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMpNr9fFiGPdO3pCb4Igo0GfiFV4ks5txH0CgaJpZM4TuNLM> .

martindurant · 2018-05-14T17:46:24Z

Hah, it turns out that githubusercontent supplies incorrect HTTP header information: the Content-Length is short, and this truncates the catalogue text. I'll put a fix into dask, since that information should not be required in a case like this, where reading the whole file start-to-end.

martindurant · 2018-05-15T20:55:45Z

With that dask/3496 merged, I can now point to the repo (note that this is not yet released in dask).
Of course, using the master-branch location, it won't actually be usable until this PR is merged. I'll update the notebooks nevertheless. To read the catalogue and use it, you would need intake installed, and to actually load anything, intake-xarray. Both are on the intake conda channel. The only additional dependency is requests, which we probably already depend on.

rabernat · 2018-05-16T13:39:59Z

Are we ready to merge this? Has someone tried building the docker image with these changes and confirmed it works?

martindurant · 2018-05-16T13:48:10Z

At the moment, there are no instructions to install intake/intake-xarray - I am not sure where they would go. Also, the URL is still directly to GCS, not to github as mrocklin suggested, but of course the catalogue isn't in the master branch yet, so that would not work until after merging.

mrocklin · 2018-05-16T13:51:36Z

My understanding is that this would work if we do the following:

Install intake in docker image
Update dask to master in docker image
Change links to point to the github web address

rabernat · 2018-05-16T14:07:40Z

Thanks for clearing that up @mrocklin. Let's add those changes to this PR and then merge it so we can give intake a try!

rabernat · 2018-05-17T14:55:27Z

FYI, I hope to rebuild the docker images asap (ideally later today) in order to get the new xarray release. It would be great to merge this PR and get intake into that image.

martindurant · 2018-05-17T14:57:40Z

@rabernat , so I should push to change the cat location to where the file will end up on github?

rabernat · 2018-05-17T15:04:34Z

Yes. I think the three steps outline by @mrocklin are what is needed here. That includes updating the cat location.

Once this is live, I imagine that we will update the cat file to include all the datasets we have uploaded so far (via future PRs).

Note, requires dask dask/dask#3496 (v0.17.5) in order for HTTP read not to truncate.

mrocklin · 2018-05-17T15:46:53Z

gce/notebook/examples/sea-surface-height.ipynb

@@ -60,7 +60,7 @@
    "### Alternate method to load the data using Intake\n",
    "### Where the remote YAML spec points to the data on GCS\n",
    "# import intake\n",
-    "# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n",
+    "# ds = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/notebook/examples/catalog.yaml').sea_surface.to_dask()\n",


I recommend that we place this here instead:

https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/catalog.yaml

(was already added in notebook dockerfile)

(dask 0.17.5 now released)

rabernat · 2018-05-17T17:31:47Z

Since I see dask and bokeh being tweaked, is there any chance this will fix #236 (which is still live on the cluster)?

martindurant · 2018-05-17T17:35:30Z

I'm afraid I don't know the status of that error.

rabernat · 2018-05-17T17:40:45Z

Ok I'm going to merge this and then start building new docker images.

rabernat · 2018-05-17T17:44:40Z

Oops, I missed this before merging:

Bokeh 0.12.16 is out. If I update to that, will it cause any conflicts with dask? (Our current version is 0.12.15dev1.)

martindurant · 2018-05-17T17:48:36Z

https://travis-ci.org/dask/distributed/jobs/379395126#L651 seems to have passed tests with that bokeh. Not quite the same as running in production.

martindurant · 2018-05-17T17:55:23Z

Latest bokeh also works for me locally

Add example catalog and change one notebook

9799091

martindurant changed the title ~~Add example catalog and change one notebook~~ Example of using intake in a notebook May 1, 2018

Merge branch 'master' into intake

4ca1f63

mrocklin reviewed May 10, 2018

View reviewed changes

martindurant mentioned this pull request May 14, 2018

Read whole files fix regardless of header for HTTP dask/dask#3496

Merged

rabernat mentioned this pull request May 17, 2018

enable cftimeindex on pangeo docker images #257

Closed

Update catalogue links

7429771

Note, requires dask dask/dask#3496 (v0.17.5) in order for HTTP read not to truncate.

mrocklin reviewed May 17, 2018

View reviewed changes

Martin Durant added 3 commits May 17, 2018 11:50

Move cat file higher

88107d3

Add intake to general environment

baa93b8

(was already added in notebook dockerfile)

update precise dask version

931573e

(dask 0.17.5 now released)

rabernat merged commit 71055c4 into pangeo-data:master May 17, 2018

martindurant deleted the intake branch May 17, 2018 17:41

rabernat mentioned this pull request May 20, 2018

we need a data catalog #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of using intake in a notebook #231

Example of using intake in a notebook #231

martindurant commented May 1, 2018

jhamman commented May 3, 2018

mrocklin commented May 3, 2018

mrocklin commented May 3, 2018

mrocklin commented May 3, 2018

martindurant commented May 3, 2018

mrocklin commented May 3, 2018

martindurant commented May 3, 2018

martindurant commented May 3, 2018

mrocklin commented May 3, 2018 via email

martindurant commented May 6, 2018 •

edited

Loading

jacobtomlinson commented May 9, 2018

mrocklin May 10, 2018

rabernat May 10, 2018

martindurant May 10, 2018

mrocklin commented May 10, 2018 via email

martindurant commented May 10, 2018

mrocklin commented May 10, 2018 via email

martindurant commented May 14, 2018

martindurant commented May 15, 2018

rabernat commented May 16, 2018 •

edited

Loading

martindurant commented May 16, 2018

mrocklin commented May 16, 2018

rabernat commented May 16, 2018

rabernat commented May 17, 2018

martindurant commented May 17, 2018 •

edited

Loading

rabernat commented May 17, 2018

mrocklin May 17, 2018

rabernat commented May 17, 2018

martindurant commented May 17, 2018

rabernat commented May 17, 2018

rabernat commented May 17, 2018

martindurant commented May 17, 2018

martindurant commented May 17, 2018

Example of using intake in a notebook #231

Example of using intake in a notebook #231

Conversation

martindurant commented May 1, 2018

jhamman commented May 3, 2018

mrocklin commented May 3, 2018

mrocklin commented May 3, 2018

mrocklin commented May 3, 2018

martindurant commented May 3, 2018

mrocklin commented May 3, 2018

martindurant commented May 3, 2018

martindurant commented May 3, 2018

mrocklin commented May 3, 2018 via email

martindurant commented May 6, 2018 • edited Loading

jacobtomlinson commented May 9, 2018

mrocklin May 10, 2018

Choose a reason for hiding this comment

rabernat May 10, 2018

Choose a reason for hiding this comment

martindurant May 10, 2018

Choose a reason for hiding this comment

mrocklin commented May 10, 2018 via email

martindurant commented May 10, 2018

mrocklin commented May 10, 2018 via email

martindurant commented May 14, 2018

martindurant commented May 15, 2018

rabernat commented May 16, 2018 • edited Loading

martindurant commented May 16, 2018

mrocklin commented May 16, 2018

rabernat commented May 16, 2018

rabernat commented May 17, 2018

martindurant commented May 17, 2018 • edited Loading

rabernat commented May 17, 2018

mrocklin May 17, 2018

Choose a reason for hiding this comment

rabernat commented May 17, 2018

martindurant commented May 17, 2018

rabernat commented May 17, 2018

rabernat commented May 17, 2018

martindurant commented May 17, 2018

martindurant commented May 17, 2018

martindurant commented May 6, 2018 •

edited

Loading

rabernat commented May 16, 2018 •

edited

Loading

martindurant commented May 17, 2018 •

edited

Loading