-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of using intake in a notebook #231
Conversation
@martindurant - thanks for putting this together. A few thoughts for you and @rabernat and @mrocklin:
|
Short term presumably people can |
I agree that this does look pretty nice :) |
This is typically how we control things that we control. It's not a bad medium-term solution.
Agreed. Can intake point to a file on the web? |
You mean a remote cat file? |
Additionally, we might want to control that for the user as an environment variable
This way they just have to do something like the following: import intake
catalog = intake.Catalog()
catalog.my_dataset Honestly, it wouldn't be unwelcome to skip the middle line import intake # looks for environment variable, auto-populates global catalog
intake.my_dataset That might be too much though... |
We do have the concept of a built-in catalogue and of conda packages that populate it, so you can conda-install and have the entries available under |
Of course, intake is just reaching release now, and we are working on the documentation to make these points clear. |
Can we populate the built-in catalog from the remote web resource? My
guess is that it will be easier for us to publish a yaml file on some
github repository than it will be for us to publish conda packages
…On Thu, May 3, 2018 at 5:30 PM, Martin Durant ***@***.***> wrote:
Of course, intake is just reaching release now, and we are working on the
documentation to make these points clear.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#231 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszILOm-iSHS7Sb3V3oCXyJISu_41kks5tu3cMgaJpZM4TuNLM>
.
|
intake/intake#87 |
This shows that the data source listing can be remote too, using the `master` version of Intake. Now the Intake parts are given as comments in the notebooks, so that users can try it without forcing a dependence of Intake on anyone. Notes - intake is not included in the installation, so we need either instructions or to add to the environment - the cat file is hosted on my bucket, but publicly accessible. If the idea seems like a good one, it would presumably move to pangeo's bucket.
"### Alternate method to load the data using Intake\n", | ||
"### Where the remote YAML spec points to the data on GCS\n", | ||
"# import intake\n", | ||
"# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
Thoughts on using a file in this github repository instead? Perhaps in pangeo/gce/catalog.yaml
? I suspect that more people within the community are familiar with editing github repositories than modifying files on GCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I would prefer to have the catalog live in a repo. That way we will have a clear process for updating it.
If you like, I can make a new repo for that purpose right now under the pangeo-data org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work fine as well. I don't know where the right place for this would be, so happy to use whatever is convenient.
I suggest that we keep it in this repo for a while and then spin out in the
future after things solidify
…On Thu, May 10, 2018 at 1:23 PM, Ryan Abernathey ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In gce/notebook/examples/sea-surface-height.ipynb
<#231 (comment)>:
> @@ -51,6 +51,19 @@
"ds"
]
},
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "### Alternate method to load the data using Intake\n",
+ "### Where the remote YAML spec points to the data on GCS\n",
+ "# import intake\n",
+ "# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n",
Agreed. I would prefer to have the catalog live in a repo. That way we
will have a clear process for updating it.
If you like, I can make a new repo for that purpose right now under the
pangeo-data org.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#231 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszM1NNY_Ms7sBDSWxKfL_r-4eC8V8ks5txHd5gaJpZM4TuNLM>
.
|
For the time being, the catalog file is checked in here too, so it can just be a local file in the notebook or point to the remote github location, which would then update whenever the repo was updated. |
I recommend that we check it in here, but point to the web address in the
notebook
https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/catalog.yaml
…On Thu, May 10, 2018 at 1:46 PM, Martin Durant ***@***.***> wrote:
For the time being, the catalog file is checked in here too, so it can
just be a local file in the notebook or point to the remote github
location, which would then update whenever the repo was updated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#231 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszMpNr9fFiGPdO3pCb4Igo0GfiFV4ks5txH0CgaJpZM4TuNLM>
.
|
Hah, it turns out that githubusercontent supplies incorrect HTTP header information: the Content-Length is short, and this truncates the catalogue text. I'll put a fix into dask, since that information should not be required in a case like this, where reading the whole file start-to-end. |
With that dask/3496 merged, I can now point to the repo (note that this is not yet released in dask). |
Are we ready to merge this? Has someone tried building the docker image with these changes and confirmed it works? |
At the moment, there are no instructions to install intake/intake-xarray - I am not sure where they would go. Also, the URL is still directly to GCS, not to github as mrocklin suggested, but of course the catalogue isn't in the master branch yet, so that would not work until after merging. |
My understanding is that this would work if we do the following:
|
Thanks for clearing that up @mrocklin. Let's add those changes to this PR and then merge it so we can give intake a try! |
FYI, I hope to rebuild the docker images asap (ideally later today) in order to get the new xarray release. It would be great to merge this PR and get intake into that image. |
@rabernat , so I should push to change the cat location to where the file will end up on github? |
Yes. I think the three steps outline by @mrocklin are what is needed here. That includes updating the cat location. Once this is live, I imagine that we will update the cat file to include all the datasets we have uploaded so far (via future PRs). |
Note, requires dask dask/dask#3496 (v0.17.5) in order for HTTP read not to truncate.
@@ -60,7 +60,7 @@ | |||
"### Alternate method to load the data using Intake\n", | |||
"### Where the remote YAML spec points to the data on GCS\n", | |||
"# import intake\n", | |||
"# ds = intake.Catalog('gcs://mdtemp/pangeo.yaml').sea_surface.to_dask()\n", | |||
"# ds = intake.Catalog('https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/notebook/examples/catalog.yaml').sea_surface.to_dask()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend that we place this here instead:
https://raw.githubusercontent.com/pangeo-data/pangeo/master/gce/catalog.yaml
(was already added in notebook dockerfile)
(dask 0.17.5 now released)
Since I see dask and bokeh being tweaked, is there any chance this will fix #236 (which is still live on the cluster)? |
I'm afraid I don't know the status of that error. |
Ok I'm going to merge this and then start building new docker images. |
Oops, I missed this before merging: Bokeh 0.12.16 is out. If I update to that, will it cause any conflicts with dask? (Our current version is 0.12.15dev1.) |
https://travis-ci.org/dask/distributed/jobs/379395126#L651 seems to have passed tests with that bokeh. Not quite the same as running in production. |
Latest bokeh also works for me locally |
Ref : #39
Here is an example of how this would work. I have only edited the
xarray-data
notebook. Essentially, all of the options that were in the original open function calls go into arguments in the catalogue file. We use the same conventions for URLs as dask main, and (read-only) mappers are created automatically. This latter step is probably something that should eventually be made generic for all file-systems.