-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate methods of making R builds faster #412
Comments
From talking to R users it seems that binary packages for linux aren't a thing with CRAN?!? Pointers and contributions would be welcome as it is pretty frustrating to see all these packages being built from source over and over again. https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/install.packages mentions binary packages (search for/scroll down to the "binary packages" heading) but also says they don't exist for linux. This seems to be the reason repo2docker compiles stuff from scratch. Not sure how we fix that or why there isn't an equivalent to the manylinux wheels for Python packages. Overall I think more R expertise would be welcome ;) |
Here is a relevant post: http://dirk.eddelbuettel.com/blog/2017/12/13/ |
Thanks. We will need to update our instructions to R users to install APT packages instead of R packages. And potentially check where the packages get installed to and if that will require any more gymnastics with the environment variables for the search paths of R packages. |
@betatim +1 to that - I think we should add an example (maybe |
https://github.com/betatim/dockerfile-r/tree/use-apt-instead (diff) seems to work but shiny is now broken. I assume there are some dependencies needed that are part of the tidyverse or something? |
I've also dealt with slow R-focused image builds. Using distro packages helps but not everything is packaged. When building from source, you need to make sure that the dependencies aren't automatically built from source as well, even if they are already installed as distro packages. I set Unrelated, I also rely on the read-only repositories in the "cran" org to help pin the version for reproducibility. It's often easier than finding the upstream repo. |
Another thing to check/update the docs: the Ubuntu package versions will not correspond to the MRAN date set in the Chris can we update the top comment with a to do list of these points/things to check? All these caveats make me wonder if we should leave using Ubuntu packages to get faster installs as an "expert thing you can do if you know you can do it"? For most repositories there should be a lot more launches of already built images than builds. With that in mind and a goal of reproducibility it might be better to go for slow but simpler?? |
The disconnect with mran is an issue. The big challenge though is that the tidyverse is very popular package and most binder use cases will require it. I agree that it would be critical to make the speed versus reproducibility tradeoff clear. Slow builds are totally fine but I was having issues with build failures on a real world use case (6-8 medium to large(ish) packages) |
@betatim I think that if the result is a build failure due to simply trying to build the tidyverse + something small-ish, then we need to change something to make this a more tenable option for people. Another, more complicated option would be to go back to using a Rocker image with some of the bigger packages pre-installed (or we could just recommend that people do this) |
Does someone have a link to a repository/configuration that fails to build? That would make it easier to investigate why exactly it fails. If it is the amount of RAM required we should move the discussion to mybinder.org-deploy and with our mybinder hats on discuss increasing the RAM. Building things like tidyverse straight from I'd be hesitant with using rocker as we will end up with two different base images which will make it tricky to mix build packs :-/ |
actually @SylvainCorlay made a good point that we could also recommend that people install with |
I'm not really up to speed with Python package management, but I don't believe this is really a problem that is in any way unique to R. I think it would be immensely difficult for CRAN to provide pre-built binaries for all linux distributions for all packages. Even within a single linux distribution like Debian, the maintainers don't manage to provide pre-built binaries for all CRAN packages, and those that are provided can often lag behind the CRAN versions (which after all are updated continuously, unlike the Bioconductor R packages). More importantly, I think that mixing and matching prebuilt binaries can lead to really undesirable situations (e.g http://vincebuffalo.org/notes/2017/08/28/notes-on-anaconda.html). It is really hard to know what versions of what external libraries different binaries are going to be built against and how they will be compatible etc etc. I would never recommend R users install conda-forge because of these issues. Docker gives us a really nice way of creating a consistent and transparent build process with control over our libraries and versions by pre-building things from source, and then we just pull down the Docker image as the 'ultimate' pre-built binary. In this way, it's really easy to know which version of jags, libgdal, and other critical libraries are being linked, which compiler versions are used and so forth. I think it's strongly preferable to define the environment you want in your base Docker images rather than try to pull in binaries built under more opaque settings out of your control. Of course that means taking responsibility to maintain that stack, which may not be ideal -- that's the purpose of having the rocker images in the first place that already do exactly this. I do recognize @betatim 's point about mixing build packs, but at least in this way you can do so in a way that is more transparent, rather nesting virtualization inside virtualization. The rocker images are built on vanilla debian images -- the versioned stack builds on the stable debian release with no additional apt repositories, so you have a very predictable and standard set of libraries to build against. Okay, many apologies for the rant, you probably know these details much better than I. just have seen a lot of thorny install and runtime issues, both with python and R, that arise from chimeric combinations of different prebuilt binaries and virtualized environments. |
Thanks for the input @cboettig - I'm trying to synthesize the main ideas from your post, are you suggesting that R users with repo2docker/Binder would be better off using a Dockerfile and sourcing their image from the Rocker images? The biggest challenge with this is that repo2docker's goal is to let users "start from scratch" and only add the packages they want in an explicit fashion. Moreover we are intentionally treating anything that requires knowledge of Docker as "advanced" and probably not suitable for most users. It sounds like our potential solutions are:
Am I missing something? |
I'd challenge the assumption in 1) that some dependencies don't build because of limits on mybinder.org. I asked a while ago for examples and so far no one has pointed us to one so I'd change it to "Tell users to compile packages during the build (as we do now), which takes up extra time". I think it is still unclear why there aren't more binary packages for R on linux given that for Python packages (which require a compiler during install) there are binary packages for linux (https://github.com/pypa/manylinux) and conda provides binaries that work on most linux systems. Working on this seems like something that would be very useful to the wider R community. |
There are binaries available from the c2d4u PPA. I asked Michael Rutter a while back if he had considered making daily snapshots and he said it "would be a great service, I just don't have time to create such a thing at this point." If CRAN were to mirror this PPA I wonder if MRAN would snapshot it along with everything else. |
I agree with @betatim that it's not obvious to me that there is an issue here in the first place with just letting packages install from source. Does my-binder cache the builds? I think it does at least when I send it a Dockerfile (e.g. I see it build my image slowly the first time, but next time I click the button it's fast, so no problem). That said, a good test case to create a slow build might be:
( I don't follow Tim's point about their not being lots of binaries for linux -- there's binaries available for almost all packages for most of the popular distros. It's just that linux binaries in R are usually distributed through the distro's standard package managers, while I think Tim is referring to python binaries that are platform-agnostic and distributed through some other means. It is probably just my ignorance here, but I don't see why that would be preferable; it sounds like it could be awful difficult to get working in all cases. My 2c is that the current setup that relies on I think using other alternatives just creates more trouble than it is are worth -- folks may use the approach when it's not necessary and I have never liked the idea inventing yet another custom solution such as an |
My original comments about super long and/or memory-limiting builds in R were just from @karthik telling me his attempts at building a tidyverse repo weren't working...maybe that isn't reproducible though? (as a friendly aside, while I enjoy the back-and-forth about the merits of linux binaries etc for the R community, can we keep conversation in this thread to helping us decide what to recommend for R users in repo2docker? :-) ) |
@choldgraf I can give you a reproducible example next week. Still takes a very long time to build (as you witnessed once). |
Yes, for each commit that is launched we build it once and if the build succeeds the next time you click the link you get a built image and don't incur build time delay. Building the tidyverse from master on mybinder.org takes so long that I close my browser (the build continues) and come back later. My point about binaries was that I didn't quite understand why |
ah, I see your point. To me it's always been the first part of this that is more surprising and more unusual. My experience with just about every other language has been that not providing platform-specific binaries is the default. I think R core decided relatively early on that many Mac and Windows users might not have the C & fortran compilers and dynamic system libraries installed or know how to install them, so it has long provided binaries for those platforms to make R easier to adopt (and I suspect this contributed greatly to adoption in R). For the Linux users, they probably thought "hey, you use Linux, you know how to do these things (or ask your sys admin)." As you know, pre-building static binaries for packages that link external libraries can be a huge challenge -- no more dynamic linking, everything has to be packaged with the binary which also has different implications for software licenses etc. There are often packages that get stuck a few versions behind in the Mac and/or Windows builds for reasons, sometimes for years. A small volunteer team keeps these binaries going; sometimes with some pretty clever/convoluted config files. Until the advent of stuff like Ubuntu snaps or Docker, I was unaware of anything else that tried to tackle distribution/platform agnostic binaries. |
I'm totally fine with taking a long time, esp since it will be cached for future runs. My issue was that a couple of real world examples (not just tidyverse) were never building. |
@cboettig for Python, the recommendation is to not install anything from apt, since those go out of date very quickly. Distributing binaries that work well is a PITA, but the Python community has actually done a really awesome job with https://github.com/pypa/manylinux, and most installs from PyPI bring in binaries on linux too now. This is how we sidestep this problem with requirements.txt for example. @choldgraf @karthik could either of you have a reproducible example? We wanna make sure R users can use binder in a first class way, and will try to fix this. |
@yuvipanda Thanks! This does a good job of showing differences in approaches in different communities. Of course you know this, but because I think it could confuse other readers here: I think it is very misleading to stay "not install anything from apt, since those go out of date very quickly. " apt is as up-to-date as the repos you choose -- if your The manylinux approach is very interesting, definitely a different way to go from providing a PPA (and somewhat more general). It looks to me like the compromise there is that everything is being compiled with ancient compilers of Centos 5 in order to be compatible. Some R users don't like that the rocker/versioned stack is building the latest versions on debian:stable compilers, which are modern by comparison -- which is another reason we have the separate debian:testing based stack (and of course you can just add the above-mentioned PPAs to most recent ubuntu systems). I only mention this just to underscore the point I think we can all agree on, that providing pre-built binaries is a fundamentally hard problem with inevitable trade-offs. I'm all for ameliorating the situation for the user, but I'm perpetually wary that we will get ourselves in trouble with overly general statements or trivializing these challenges. |
I'm with you all the way, @cboettig - especially on 'binaries are hard'! To qualify my statement about apt, I'll say 'do not install python packages with apt'. Unlike the R community, which seems to provide up to date builds of many libraries on apt in a timely fashion, python library upgrades are a lot more scattered & unpredictable. For example, the most bleeding edge version of the popular I think @karthik and @choldgraf's original issue is that builds from install.R are too slow & time out. Recommending people install R packages from apt whenever possible is one solution. However, builds should never really time out, so that's a bug in mybinder.org / binderhub that we'd love to get to the bottom of - completely separate from preferred methods of installing packages. |
Yup, 💯% on the same page. Re slow builds, I've just mocked up: https://github.com/cboettig/r with an |
Thanks for this @yuvipanda and @cboettig I'll come up with a couple of examples and update here tomorrow. |
@yuvipanda note that I get something that looks like an error but isn't one if I sleep my machine a re-open it. Binder web page favicon changes to an error icon, and the log shows: "Error: Failed to connect to event stream". Refresh the page and all looks well again (log is still chugging away compiling source-code from R packages... about 53 minutes in now) , but a user could mistake this for a build error. |
I have played around a bit with source vs. apt installation in the context of #457 Though it can't contribute much despite confirming the "pre-build binaries is hard", it shows the additional challenge of binary R packages installed by repo2docker and source packages installed by a user vs. the Docker build chaching. I did not get it to work, might be a stupid error.. https://gist.github.com/nuest/8beca3b75bba97f107a314798879a2fc |
My example mentioned above built successfully (allbeit without |
@cboettig would you say that the build you triggered is a "typical" workflow for an R user? Not "the whole stack they ever use" but "a reasonable stack to reproduce an analysis"? If so, then we should really find a way to make people not wait an hour for that to build :-/ |
That is exactly what I was trying to do at Numfocus. I have a few ideas I'd like to discuss (and demo). Are you and @yuvipanda around for a chat next week at BIDS? |
@choldgraf "typical" will vary widely of course, but it's realistic or even small for large spatial analysis.
|
Currently adding PPAs is impossible, Pre-installing some packages via a different mechanism that can then be re-installed depending on which command exactly users use feels like a road to lots of maintenance burden and user confusion :-/ I think when considering this we should measure it against telling users to use (pre-made) I am with Carl that having to wait the first time isn't a bit deal on a hosted platform like mybinder.org. When it becomes annoying is when you use |
I am making a binder-examples repo that uses the rocker base images here: https://github.com/binder-examples/rocker This should be much faster than the install.R method for this specific use case. I've also opened rocker-org/binder#30 to add better shiny support to the rocker/binder image. |
I just tried the @cboettig example (https://mybinder.org/v2/gh/cboettig/r/master?urlpath=rstudio), but it looks like the Nevertheless, having never used docker, I found it very easy to copy this small Dockerfile (linking to rocker/binder) to my repository: https://github.com/rocker-org/binder/blob/master/binder/Dockerfile. Now |
@LennertSchepers that's a good point -- using an Like you, I also use the rocker-org/binder Dockerfiles as a base image instead for most of my binder projects, since the spatial system libraries are all pre-installed there. |
Also on a related note: it would probably be preferable to list Dependencies in a DESCRIPTION file and use |
|
Ref: #716 for instant rebuilds |
I recently spoke with @karthik who mentioned that our R builds (with
install_packages
) seems to be going really slowly. There could be a couple problems which I'll list here:It's possible to install some R packages in Ubuntu much faster by installing binaries. We could recommend this in the documentation for specifying R packages and such...
relevant blog post: http://dirk.eddelbuettel.com/blog/2017/12/13/
old points:
1. mybinder.org may not have enough RAM which is causing the build to be really slow for certain packages (like the tidyverse). Apparently many R packages have intermediate steps during install that use multiple gigs of RAM.2. We aren't using some binary packages even though they are available. repo2docker seems to be building everything from source, even though for some packages there are binaries out there. We could investigate to see if this is an option!The text was updated successfully, but these errors were encountered: