Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Buildkit caching? #875

Open
jameshowison opened this issue Apr 16, 2020 · 9 comments
Open

Docker Buildkit caching? #875

jameshowison opened this issue Apr 16, 2020 · 9 comments

Comments

@jameshowison
Copy link

Proposed change

It would be wonderful to use the Docker buildkit caching capabilities. These enable incremental addition of packages, so one can add a single package to lists like requirements.txt. That invalidates the standard Docker layer caching, but the layer is quickly rebuilt because all the compilation without triggering entire rebuilds. The actual building happens on a special Docker container (which retains the caches).

https://docs.docker.com/develop/develop-images/build_enhancements/

This question includes links to examples for python building (but points out that it doesn't work for R building (which apparently is going to need something like renv to work).

https://stackoverflow.com/questions/59253392/using-docker-buildkit-caching-with-r-packages

Alternative options

I don't know enough to know if repo2docker is already doing something awesome to reuse compilation here. Perhaps a Docker layer per package build? I wonder if that would cause issues, though.

Who would use this feature?

Anyone adding a package would benefit from much quicker rebuilds. Should also help with builds in a place like mybinder.

How much effort will adding it take?

I haven't yet looked at the repo2docker build code, so I don't know. Mea Culpa. Biggest issue is that this requires fairly recent Docker and changes the build process so that builds happen in a Docker container rather than on the host.

Who can do this work?

I could help test, but have not yet dived into the repo2docker code.

@betatim
Copy link
Member

betatim commented Apr 17, 2020

Thanks for creating the issue and answering all our questions. I think this is a cool idea, anything to make builds faster is :)

Currently we try and be smart about the order of the statements in the generated Dockerfile. For example we attempt to detect if a requirements.txt refers to contents of the repository or not. If not we run pip install -r requiremenst.txt before copying over the rest of the repository. This means fixing a typo a README can reuse that layer. #749 and #854 go along those lines as well. BuildKit looks like a new avenue we can take to speed up builds.

I read https://blog.mobyproject.org/introducing-buildkit-17e056cc5317 for a quick introduction what it is and then looked at the repository. With docker v19.x you get BuildKit "by default" but need to enable it.

We use docker-py to talk to docker. Currently docker-py does not yet support BuildKit. This issue has some details on what is hard about adding support and where to help make it happen.

@jameshowison
Copy link
Author

Not sure what to do about the docker-py and buildx issues, but I did manage to get renv working with docker buildx (after stumbling around for quite a while :)

https://github.com/howisonlab/test_repo_buildx_renv

I tried to keep things as close to REES by using an unchanged install.R file :)

@jameshowison
Copy link
Author

The ffsync issue in docker/docker-py#2230 (comment) does seem like a long-term blocker. And the main driver for it was docker-compose but the advice is to use the CLI for that? So that thread reads to me that py-docker is pretty much stopped in development? Buildx bake seems like it's an option?

I guess whether deciding to depend on buildx depends on the long-term road map for docker, although I'm pretty sure that buildx is part of that, if not becoming the default building setup in the fair near future. I have no special knowledge though!

@minrk
Copy link
Member

minrk commented Apr 29, 2021

I just came back to this idea after having a super positive experience with buildx build caches for a large, repeated conda install with small changes between rebuilds. I think it would be a massive win for one of the major sources of build time on Binder.

It seems like waiting for docker-py to support buildkit is not likely to happen any time soon, so we should look into building with the CLI. The main hurdle, I think, is that we currently construct a tarball for the build context in-memory, whereas buildx needs an extracted on-disk directory (which it will then re-serialize and re-send).

One way this might work is to take what we already have, and:

  1. unarchive what we have in a tempdir
  2. run buildx build in the tempdir

This is probably the shortest route to "it works", though there would be quite a few duplicate files (the whole repo, for one), and possibly lose some ownership info we encode, but it ought to work. It could also make debugging a lot easier than it is now, as there would be a directory on disk where one could edit and debug with docker build .. A second optimization would be to keep the tempdir, but skip the tarball and use hardlinks.

A second option would be to skip the tarfile, build locally, and put all our staged-in files in a special .repo2docker-build directory. That might be harder, though, as our build files would be inside the repo itself, complicating things.

@manics
Copy link
Member

manics commented Apr 29, 2021

One way this might work is to take what we already have, and:

unarchive what we have in a tempdir
run buildx build in the tempdir

That's pretty much what I'm doing in https://github.com/manics/repo2docker-podman/
https://github.com/manics/repo2docker-podman/blob/225265a8c09733250eb1a21efae4c12c7bf35e57/repo2podman/podman.py#L281-L288
and https://github.com/manics/repo2shellscript

Would this be a good use case for #848 (both my above projects rely on it), and putting buildx in a new engine?

@ryanlovett
Copy link
Collaborator

I was exploring using multi-stage docker builds with buildkit and was impressed by the concurrency performance. Is invoking the CLI from repo2docker (via a new interface) still thought to be the preferred method to support it? The docker-py/buildkit issue linked above has not seen much progress.

@manics
Copy link
Member

manics commented Jun 29, 2023

#848 was merged a while ago and has been included in several releases, so I think it's the best way to develop or experiment with a new container engine- we can discuss later if it should be merged into the core repo2docker or kept as an optional addon.

I've no idea what the best way to implementing buildkit is- if you go down the CLI exec route that's what I did with podman https://github.com/manics/repo2podman/blob/main/repo2podman/podman.py so you might be able to do a search and replace to get started?

@ryanlovett
Copy link
Collaborator

Great, thanks @manics !

@ryanlovett
Copy link
Collaborator

@manics I was able to get repo2docker to user docker's buildx by telling repo2podman to use docker as the CLI! I ran:

repo2docker --engine podman --PodmanEngine.podman_executable=docker

This required patching the json output format in repo2podman a little, and telling docker to default to buildx. For the former, I'll create an issue in repo2podman. It's just kind of funny using the repo2podman plugin to get repo2docker to actually run docker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants