-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some steps to explore supporting a "default environment" #1474
Comments
would love some thoughts from y'all 👍 |
It's great and necessary to be able to support arbitrary environments. But I believe that the scipy ecosystem would be well served by standardizing around a smaller set of common environments, maintained and updated via CI, and released on roughly a monthly frequency. These could then be reused in cloud-based hubs, binders, and local environments. This is what we are haltingly moving towards in Pangeo. See for example,
|
Just a note that there's prior art here from the R community too: https://www.rocker-project.org/ runs "community images for the R community" that many sub-communities then build off of. In fact, the holepunch project basically replicates the functionality we're describing here (though in that case, specifically for the R community) also re: standardizing on a set of images, I agree w/ that - though I don't think the Binder team wants to be ones in charge of that curation just from a maintenance and organizational perspective |
I took a look at the jupyter-stacks, and they seem to have some pretty nice images across a few languages already: https://github.com/jupyter/docker-stacks I wonder what are the steps to make those images "binder ready". Maybe @minrk or @parente have ideas? |
Hi @choldgraf - apologies for not doing a search beforehand, so maybe this is already in a separate issue or has come up in the past. Is it possible for BinderHub to bypass the linked image registry and pull directly from DockerHub? I think this would be really useful for "default environments" for a couple reasons. For example, we want to use
Of course, there is an issue with using "latest" images for reproducibility. But, it is very convenient for specifying the most up-to-date image without constantly updating explicit tags. |
@scottyhq I think it would be better to start a forum thread for your question about images that inherit from I don't think this thread is a good place to discuss if/which/if not/where the data science community should maintain a unified image. That is such a big question with a lot of trade-offs and even if we decided "yes let's do it" I'd vote for mybinder.org to wait of the order of 6months before adopting it for the "default env" use case. This is because we want to follow not lead. I think it would be a good topic to do some research into via a forum thread. What would be in such an image, who maintains it, has someone tried this before, is it already happening, etc.
https://mybinder.org/v2/gh/jupyter/docker-stacks/master?filepath=README.ipynb makes me think the answer is yes. This uses a My vote would be to use one of the docker stacks images. Ideally one with some "data science" stuff already installed, not one of the base ones. Concretely I'd vote for the |
Some background on docker-stacks: maintaining those stacks has proven to be a massive maintenance challenge and we haven't been able to keep up, in no small part because we have to frequently make arbitrary decisions about "does X package belong in Y stack?" and "should we support X use case?" Plus, the layering and inheritence of those as a family of images instead of independent stacks, on top of the growing variety of contexts they support (setting uid at runtime, user install permissions or not, etc.) makes them super complicated and huge. This has proven unsustainable, and we have been pushing for most docker-stacks users to switch to repo2docker instead, since it's vastly simpler and easier to maintain and control. At the very least, I think we should be switching the docker-stacks maintenance to repo2docker builds of specific A warning though: if we start maintaining our own "binder" stacks where each stack is a single environment.yml, we will give ourselves the exact same "who gets to decide what's in a stack?" problem that plagues docker-stacks. |
I believe maintaining the docker-stacks got significantly better once we outlined the scope of the project and how the community could contribute (e.g., recipes and how to contribute them, community stacks and a place to list them, selection criteria for new features). That scoping happened relatively late in the 5 year project lifespan and so there are a significant number of packages, startup scripts, docker args, etc. that we now maintain in the name of stability in a wide variety of environments (e.g., local use, k8s, jupyterhub). I believe the images in the docker-stacks project are sustainable as-defined for the use cases they currently support, but see any extension of that scope as untenable (e.g., new images, new container runtime environment support, new startup hooks, new permission models). To that end, I agree with @minrk on three points:
cc: @romainx who has been helping maintain the stacks in the past 9 months and may have other insights to share |
Note - we've now got Binder support in the nbgitpuller.link page: https://discourse.jupyter.org/t/how-to-reduce-mybinder-org-repository-startup-time/4956/16 For example, here's an nbgitpuller form link with the repository already filled out:
|
In case you weren't aware, there's currently a problem with nbgitpuller.link: jupyterhub/nbgitpuller#130 |
Thanks for the heads up, should be fixed by jupyterhub/nbgitpuller#134 |
Nice, with all that, a rough plan could be:
I see three main strategies to implementing pooling for Binder repos:
Option two seems like the best balance, but it has its challenges. * Something we'll have to figure out is idle-culling. If we are spawning servers that JupyterHub is aware of and leaving them idle, we'll need to make sure the culler doesn't shut them down while they are waiting in the pool, but does shut them down when they become idle 'for real'. |
Thanks a lot for the "experience report" on docker stacks. I agree we should make our own with a very clear scope (no binder-stacks, just one binder-stack :D). I like the strategy you proposed Min. One tweak/variation, an option 2.5 maybe. What about providing a simple webservice that takes a URI like The motivation for it is:
If this sees a lot of use people would get a lot of speed improvements just based on the fact that they are using a shared image already. Then in a second step we could add the pooling from (2) later as an optimisation that makes launch times even faster. But we wouldn't have to solve the hard problem of "stateless but not pooling" first, we can solve it later. |
I just put together a little prototype to see how this feels with our current docker stacks. I made this repo: that simply pulls the https://github.com/binder-examples/jupyter-stacks-datascience Now we can create partially-filled nbgitpuller links where users only need to add their content repo and then their mybinder.org link is ready:
re: @minrk's comments, I think those all sound great. Definitely +1 on using a "regular" repository that is compatible with Binder instead of a Dockerfile. I'd also recommend being fairly strict about "we will not add your special library just because you want it", because we don't want to create a whole new maintenance chain for us and replicate the challenges that docker-stacks have already had. |
👍 on marking the default env as "might change without notice, no contributions or issues please. If you need control over your environment please ship your own." is the way to go. We might open it up at a later point or devise a regular schedule for compatibility/maintenance/etc but I'd start with something which is clearly marked as "beta" (in the original sense, not the google sense). I think naming of the "product" or feature can help set expectations. So for example I'd not call it "default env" and instead maybe something like "scratchpad". Something like "default env" makes me think it is "recommended" or "where you should start" or "what you should use if you don't know better". I'd want my packages in the "default env" because it is what is used by default, etc. A bit like "native support for X" is (somehow) better than "X is not natively supported". A scratchpad sounds more like a temporary thing, that you might use to try something out and then bin it. It is temporary and for trying stuff. As input for pondering what could/should be installed this is what |
I'm also +1 on justifying the things that go into the environment by simply referring to some other popular environment (Kaggle and Colab seem like the obvious ones here) I like |
I thought I would chime in since my PR was linked to this issue and share why and how we combine jupyter/docker-stacks
repo2docker
We decided to fuse the best of both projects and build our
The advantage of this solution is that we have two images available, one compatible with BinderHub out of the box with |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/variable-startup-times-with-a-rstudio-based-binder-example/9172/4 |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/use-published-docker-image-for-binder/10333/3 |
In a recent conversation on gitter @betatim and I brainstormed some ideas about how we could support "default environments" better. E.g., an environment that users don't have control over, but that they can use in combination with their files.
Here are a few steps that we could take - sharing them here in case others have thoughts/comments, and so we don't forget :-)
Create a repository that we think has an environment that covers 90% of "I just want it to work, and fast" use-casesdocker-stacks
image, or something another org maintains like the kaggle image.nbgitpuller
functionality more cleanly, recommending that people use this repository as a defaultnbgitpuller
docs (or the Binder docs?) semi-automatically and advertise it / document it (see Add a mybinder.org tab to the link builder page nbgitpuller#125 for reference)nbgitpuller
pattern we documented in 2, decide if it's a good ideaThe text was updated successfully, but these errors were encountered: