Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of using intake in a notebook #231

Merged
merged 7 commits into from
May 17, 2018

Conversation

martindurant
Copy link
Contributor

Ref : #39

Here is an example of how this would work. I have only edited the xarray-data notebook. Essentially, all of the options that were in the original open function calls go into arguments in the catalogue file. We use the same conventions for URLs as dask main, and (read-only) mappers are created automatically. This latter step is probably something that should eventually be made generic for all file-systems.

@martindurant martindurant changed the title Add example catalog and change one notebook Example of using intake in a notebook May 1, 2018
@jhamman
Copy link
Member

jhamman commented May 3, 2018

@martindurant - thanks for putting this together. A few thoughts for you and @rabernat and @mrocklin:

  1. I think the abstraction away from the gcsfs / open_zarr piece is really nice here. Do we, however, run the risk of sweeping some important details under the rug? The process of connecting to these data sources is quite new and it may be instructive for some users to see both the intake and gcsfs/brute-forge methods side by side.
  2. There are edits to the docker files here. This will require redeploying the cluster for these changes to take. Do we have any hesitations adding intake as a dependency? (I don't).
  3. How do we make sure everyone get's access to this yaml file. Can we host this somewhere on the web? I guess right now we're copying this over through the docker image but it may be better (long term) to have it stored elsewhere.

@mrocklin
Copy link
Member

mrocklin commented May 3, 2018

There are edits to the docker files here. This will require redeploying the cluster for these changes to take. Do we have any hesitations adding intake as a dependency? (I don't).

Short term presumably people can pip install intake in their local environments to try things out?

@mrocklin
Copy link
Member

mrocklin commented May 3, 2018

I agree that this does look pretty nice :)

@mrocklin
Copy link
Member

mrocklin commented May 3, 2018

I guess right now we're copying this over through the docker image

This is typically how we control things that we control. It's not a bad medium-term solution.

but it may be better (long term) to have it stored elsewhere

Agreed. Can intake point to a file on the web?

@martindurant
Copy link
Contributor Author

Agreed. Can intake point to a file on the web?

You mean a remote cat file?
Not at the moment. There is a convention that http(s) means connecting to an intake server, but I see no reason why we shouldn't name our own protocol for doing that (intake://) cc @seibert

@mrocklin
Copy link
Member

mrocklin commented May 3, 2018

Additionally, we might want to control that for the user as an environment variable

export INTAKE_CATALOG=/path/to/default/catalog

This way they just have to do something like the following:

import intake
catalog = intake.Catalog()
catalog.my_dataset

Honestly, it wouldn't be unwelcome to skip the middle line

import intake  # looks for environment variable, auto-populates global catalog
intake.my_dataset

That might be too much though...

@martindurant
Copy link
Contributor Author

We do have the concept of a built-in catalogue and of conda packages that populate it, so you can conda-install and have the entries available under intake.cat; these packages may include data or download their own data or point to remote data. That would be another way to distribute a pangeo cat.

@martindurant
Copy link
Contributor Author

Of course, intake is just reaching release now, and we are working on the documentation to make these points clear.

@mrocklin
Copy link
Member

mrocklin commented May 3, 2018 via email

@martindurant
Copy link
Contributor Author

martindurant commented May 6, 2018

intake/intake#87
Work to allow reading cat files from remote sources.

@jacobtomlinson
Copy link
Member

@niallrobinson

This shows that the data source listing can be remote too, using the
`master` version of Intake.

Now the Intake parts are given as comments in the notebooks, so that
users can try it without forcing a dependence of Intake on anyone.

Notes
- intake is not included in the installation, so we need either instructions
  or to add to the environment
- the cat file is hosted on my bucket, but publicly accessible. If the idea
  seems like a good one, it would presumably move to pangeo's bucket.
"### Alternate method to load the data using Intake\n",
"### Where the remote YAML spec points to the data on GCS\n",
"# import intake\n",
"# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

Thoughts on using a file in this github repository instead? Perhaps in pangeo/gce/catalog.yaml? I suspect that more people within the community are familiar with editing github repositories than modifying files on GCS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would prefer to have the catalog live in a repo. That way we will have a clear process for updating it.

If you like, I can make a new repo for that purpose right now under the pangeo-data org.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work fine as well. I don't know where the right place for this would be, so happy to use whatever is convenient.

@mrocklin
Copy link
Member

mrocklin commented May 10, 2018 via email

@martindurant
Copy link
Contributor Author

For the time being, the catalog file is checked in here too, so it can just be a local file in the notebook or point to the remote github location, which would then update whenever the repo was updated.

@mrocklin
Copy link
Member

mrocklin commented May 10, 2018 via email

@martindurant
Copy link
Contributor Author

Hah, it turns out that githubusercontent supplies incorrect HTTP header information: the Content-Length is short, and this truncates the catalogue text. I'll put a fix into dask, since that information should not be required in a case like this, where reading the whole file start-to-end.

@martindurant
Copy link
Contributor Author

With that dask/3496 merged, I can now point to the repo (note that this is not yet released in dask).
Of course, using the master-branch location, it won't actually be usable until this PR is merged. I'll update the notebooks nevertheless. To read the catalogue and use it, you would need intake installed, and to actually load anything, intake-xarray. Both are on the intake conda channel. The only additional dependency is requests, which we probably already depend on.

@rabernat
Copy link
Member

rabernat commented May 16, 2018

Are we ready to merge this? Has someone tried building the docker image with these changes and confirmed it works?

@martindurant
Copy link
Contributor Author

At the moment, there are no instructions to install intake/intake-xarray - I am not sure where they would go. Also, the URL is still directly to GCS, not to github as mrocklin suggested, but of course the catalogue isn't in the master branch yet, so that would not work until after merging.

@mrocklin
Copy link
Member

My understanding is that this would work if we do the following:

  1. Install intake in docker image
  2. Update dask to master in docker image
  3. Change links to point to the github web address

@rabernat
Copy link
Member

Thanks for clearing that up @mrocklin. Let's add those changes to this PR and then merge it so we can give intake a try!

@rabernat
Copy link
Member

FYI, I hope to rebuild the docker images asap (ideally later today) in order to get the new xarray release. It would be great to merge this PR and get intake into that image.

@martindurant
Copy link
Contributor Author

martindurant commented May 17, 2018

@rabernat , so I should push to change the cat location to where the file will end up on github?

@rabernat
Copy link
Member

Yes. I think the three steps outline by @mrocklin are what is needed here. That includes updating the cat location.

Once this is live, I imagine that we will update the cat file to include all the datasets we have uploaded so far (via future PRs).

Note, requires dask dask/dask#3496 (v0.17.5)
in order for HTTP read not to truncate.
@@ -60,7 +60,7 @@
"### Alternate method to load the data using Intake\n",
"### Where the remote YAML spec points to the data on GCS\n",
"# import intake\n",
"# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n",
"# ds = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/notebook/examples/catalog.yaml').sea_surface.to_dask()\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that we place this here instead:

https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/catalog.yaml

Martin Durant added 3 commits May 17, 2018 11:50
(was already added in notebook dockerfile)
(dask 0.17.5 now released)
@rabernat
Copy link
Member

Since I see dask and bokeh being tweaked, is there any chance this will fix #236 (which is still live on the cluster)?

@martindurant
Copy link
Contributor Author

I'm afraid I don't know the status of that error.

@rabernat
Copy link
Member

Ok I'm going to merge this and then start building new docker images.

@rabernat rabernat merged commit 71055c4 into pangeo-data:master May 17, 2018
@martindurant martindurant deleted the intake branch May 17, 2018 17:41
@rabernat
Copy link
Member

Oops, I missed this before merging:

Bokeh 0.12.16 is out. If I update to that, will it cause any conflicts with dask? (Our current version is 0.12.15dev1.)

@martindurant
Copy link
Contributor Author

https://travis-ci.org/dask/distributed/jobs/379395126#L651 seems to have passed tests with that bokeh. Not quite the same as running in production.

@martindurant
Copy link
Contributor Author

Latest bokeh also works for me locally

@rabernat rabernat mentioned this pull request May 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants