Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink Docker Image #47

Merged
merged 6 commits into from
May 29, 2019
Merged

Shrink Docker Image #47

merged 6 commits into from
May 29, 2019

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Apr 29, 2019

This does a few cleanups after a conda install to reduce image size. In order of importance:

  • Remove *.pyc files
  • Remove static libraries
  • Remove package cache
  • Remove *.js.map files (more important with jupyter-lab extensions)

Overall this drops the image size from 834 MB to 616 MB.

So far I've only applied this to the base image, I'd expect the savings to be higher on the notebook image.

@mrocklin
Copy link
Member

Ooh, nice. cc @quasiben who may have experience here

&& find /opt/conda/ -type f,l -name '*.a' -delete \
&& find /opt/conda/ -type f,l -name '*.pyc' -delete \
&& find /opt/conda/ -type f,l -name '*.js.map' -delete \
&& rm -rf /opt/conda/pkgs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am a little wary of this last rm line. The tarballs should have already been cleaned by conda clean -t. The rest of this content here should be unpacked tarballs, which are actively in use. I don't think we should remove those.

Copy link
Member Author

@jcrist jcrist Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're hardlinked, so the actual files are still there. I was surprised that this reduced the image size - I wonder if when docker compresses a layer it doesn't catch that hardlinks are identical and thus duplicates the data (the savings here are non-negligible). The savings might also be all the duplicate files left after prefix-rewriting (files that need prefixes rewritten are effectively copies). Conda is robust to a missing package cache, so I don't see this being a problem at runtime (I also tested installing/upgrading packages at runtime and things seemed to work).

Copy link
Member

@jakirkham jakirkham Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, they are hard-linked. So things may work ok.

Am also surprised to hear this decreased image size. I wonder to what extent the union filesystem used impacts this and to what extent Docker itself does. FWIW in my search to answer these questions, I came across this old PR ( moby/moby#16960 ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been some discussion about this in Jupyter docker-stacks. In particular, this comment seems relevant.

&& conda clean -tipsy
&& conda update conda -y \
&& conda clean -tipsy \
&& find /opt/conda/ -type f,l -name '*.a' -delete \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to get a list of the top 10(?) large static libraries. We may decide that the conda packages of these should split out the static libraries so they don't wind up getting installed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openblas is the biggest one by far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that one I would have guessed. Am curious about the rest as there may be some things we don't know about.

&& conda update conda -y \
&& conda clean -tipsy \
&& find /opt/conda/ -type f,l -name '*.a' -delete \
&& find /opt/conda/ -type f,l -name '*.pyc' -delete \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will probably be regenerated on first import. My guess is the actual files are smallish ~1KB. That said, would be curious to know if that matches with your experience or not. For instance how much space do all of the pyc files take up?

&& conda clean -tipsy \
&& find /opt/conda/ -type f,l -name '*.a' -delete \
&& find /opt/conda/ -type f,l -name '*.pyc' -delete \
&& find /opt/conda/ -type f,l -name '*.js.map' -delete \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea as I'm guessing people are not debugging JavaScript code in the container. Would encourage you to raise an issue in conda-forge about doing this in all cases. There may be some people that want these for debugging, in which case we can split them out. However if no one wants these files, we could start just removing them.

@jakirkham
Copy link
Member

Thanks for working on this Jim. Gave this a somewhat detailed review. Not sure if that is what you were looking for.

It was useful to see that things like .js.map files are pain points, which I wasn't aware of. Would be great if we can raise some of these to conda-forge so we can address them.

As to Ben's point, we could tini instead of dumb-init. It is used by both conda-forge and Jupyter's docker-stacks. The binary is a little more than half the size of dumb-init.

There are probably other files we could strip if we want to get more aggressive about cleaning. For example, pkgconfig and cmake directories, header files, pip cache (if used), etc.

We could also look at more barebones base images like busybox if the image size itself is too large.

As I probably don't know enough about the motivations for making the image smaller, I'm not sure what the best advice between minimalism and functionality is here, but there are some thoughts. Hope that is helpful.

@mrocklin
Copy link
Member

mrocklin commented Apr 30, 2019 via email

@mrocklin
Copy link
Member

Checking in. What's the status here?

Also, @jhamman you may be interested in these changes for Pangeo.

@mrocklin
Copy link
Member

People here may want to take a look at #49

@jhamman
Copy link
Member

jhamman commented May 11, 2019

Also, @jhamman you may be interested in these changes for Pangeo.

We use repo2docker to build all the images in pangeo-stacs. We recently made some changes upstream to reduce the size of our images taking a similar approach.

@mrocklin
Copy link
Member

@jcrist @jakirkham is there anything left to do here? Should this be merged? If so, would one of you mind merging?

@jakirkham
Copy link
Member

@jcrist, please feel free to merge when you are ready.

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

I'm working on this today, will push updates later.

This does a few cleanups after a conda install to reduce image size. In
order of importance:
- Remove `*.pyc` files
- Remove static libraries
- Remove package cache
- Remove `*.js.map` files (more important with jupyter-lab extensions)

Overall this drops the image size from `834 MB` to `616 MB`.
- Use --freeze-installed to not update base images
- Use tini in dask base image instead of dumb-init
- Reduce image size of notebook image as well.
@jcrist jcrist force-pushed the shrink-images-slightly branch from 184ca00 to 94288a0 Compare May 28, 2019 17:55
@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

This may be ready for merge. A few additional changes:

This got things down to:

  • 593 MB for the base image
  • 1.35 GB for the notebook image

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

The notebook image could be made smaller further by the following optional changes:

  • Remove vim (why is vim installed?)
  • Remove git (I assume git is installed for pip install git+... behavior?)
  • Remove usage of jupyter's base image and use our own instead. The jupyter base notebook image installs everything needed for jupyterhub/jupyter/jupyterlab, which we may not need.

The last 2 may still be wanted features, but I can't think of a good reason to have vim installed so we may want to still handle that here.

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

@mrocklin : do you remember why you added vim in ceb7abf?

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

I went ahead and added a commit removing vim (can revert if needed). Down to 1.3 GB now for the notebook image.

I think this PR is done for now, looking for review.

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

@jakirkham - if I was to create a general issue for splitting out static libraries in conda-forge, where would I put that?

@mrocklin
Copy link
Member

@mrocklin : do you remember why you added vim in ceb7abf?

I found that the benefits of including vim for debugging outweighed the relative package size increase. I'm more than happy to be overruled here though.

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

I found that the benefits of including vim for debugging outweighed the relative package size increase. I'm more than happy to be overruled here though.

It can always be installed later when debugging via conda.

- Remove unminified bokeh js
- Cleanup jupyterlab staging files
@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

Couldn't resist, got things even smaller (lots smaller for the notebook image):

With these changes we're down to:

  • daskdev/dask-notebook: 910MB
  • daskdev/dask: 578MB

I'm happy with this for now.

@jcrist
Copy link
Member Author

jcrist commented May 28, 2019

I have verified that things still work fine with the docker-compose setup. Planning on merging tomorrow if no more comments.

@mrocklin
Copy link
Member

  • daskdev/dask-notebook: 910MB
  • daskdev/dask: 578MB

Woot.

Make prepare.sh scripts match.
@jakirkham
Copy link
Member

@jakirkham - if I was to create a general issue for splitting out static libraries in conda-forge, where would I put that?

The webpage repo is the best place for this sort of thing.

@jcrist jcrist merged commit ea6c8c5 into dask:master May 29, 2019
@jcrist jcrist deleted the shrink-images-slightly branch May 29, 2019 17:02
@jcrist
Copy link
Member Author

jcrist commented May 29, 2019

Since people are already using these images, I'm not sure if we can do significant changes. I played around with using alpine linux last night, as well as not depending on the Jupyter docker stacks. These images are around 1/2 - 1/3 the size of their counterparts here, and should be drop in replacements. Repo is here: https://github.com/jcrist/alpine-dask-docker

@mrocklin
Copy link
Member

mrocklin commented May 29, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants