Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Remove the conda package cache as we can't hardlink to it #666

Merged

Conversation

betatim
Copy link
Member

@betatim betatim commented May 4, 2019

This shrinks our "base" image by about 50MB plus 40MB from the second commit.

You can see the size of the "install miniconda" layer in various scenarios here:

Maybe there is more saving potential somewhere, this was low hanging fruit. If someone knows a tool to see which files get created by each layer we could check for other temporary files left over that can be deleted.

@betatim
Copy link
Member Author

betatim commented May 4, 2019

Found https://github.com/wagoodman/dive as a way to look at layers and through that found 3a6e4b4.

@@ -69,9 +69,12 @@ fi

# Clean things out!
conda clean -tipsy
rm -rf /srv/conda/pkgs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this entry is the same as adding the -f or --force-pkgs-dirs command line argument of conda clean. Really not confident about this, but see the #667.

Copy link
Member Author

@betatim betatim May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pointers. I did some googling about caches and temporary files used by conda for #638. Mostly you find "disinformation" or confused people :-/

My thoughts from back then are in #638 (comment). I think we can't use hardlinks within a docker image (maybe we could within a single layer?) so there is no benefit to having /srv/conda/pkgs available for later layers that install the same conda package. On a normal filesystem we'd just hardlink to /srv/conda/pkgs if it contains the package we are installing, but on docker we can't do that :(

In conclusion I think we should try out -f to see if that has the same effect as rm -rf /srv/conda/pkgs. Relevant discussion: jupyter/docker-stacks#861

@betatim
Copy link
Member Author

betatim commented May 4, 2019

Investigating hardlinks in docker images a bit.

FROM busybox
# create a file of 10MB (file1), then copy it (using hardlinks) to file2,
# file3 and file33. If hardlinking works this should only consume ~10MB
# when it doesn't work we will consume ~40MB
RUN dd if=/dev/urandom of=file1 bs=1024 count=10240 && cp -l file1 file2 && cp -l file1 file3 && cp -l file1 file33 && ls -ial file*

# this is a version that doesn't use hardlinks. If you use this instead of
# the command with hardlinks the resulting image is bigger (~40MB).
#RUN dd if=/dev/urandom of=file1 bs=1024 count=10240 && cp file1 file2 && cp file1 file3 && cp file1 file33 && ls -ial file*

# this is a new layer where we attempt to hardlink to file1 again, this
# will increase the image size by 10MB because we can't hardlink
# from one layer to another. Instead we make a copy of the file
RUN cp -l file1 file4 && cp -l file1 file5 && ls -ial file*

This is a Dockerfile that creates a 10MB file and then hardlinks it a few times in the same layer and then again in a new layer. Try it with docker build --rm=true --no-cache=true - < Dockerfile-hardlinks. If hardlinking work everywhere the resulting image should be about 10MB big. However we can't link across layers so we get a 20MB image. So I think it is safe to delete the conda package cache at the end of each layer in repo2docker as we can't use it as a source for a hardlink in the next layer anyway.

Now I am not completely sure any more why removing the package cache actually reduces the image size. Maybe conda doesn't actually use hardlinks after all?

@consideRatio
Copy link
Member

I'm out of my depth about this, I don't yet understand hardlinks yet for starters. I need to read up :p

@betatim
Copy link
Member Author

betatim commented May 4, 2019

I think we should merge this, track the discussion in jupyter/docker-stacks#861 and then make a new PR to implement the things learned from the discussion there. Or someone else who is confident they understand how all this works chimes in and we can implement "the right thing" straight away :)

@parente
Copy link
Member

parente commented May 4, 2019

The section "Linking packages from package cache into environments" on https://www.anaconda.com/understanding-and-improving-condas-performance/ explains the conda behavior.

So I think it is safe to delete the conda package cache at the end of each layer in repo2docker as we can't use it as a source for a hardlink in the next layer anyway.

I believe this is correct after reading the above documentation and the results of your experiment.

@consideRatio
Copy link
Member

It is my understanding that the added remove pkg folder thing is what -f does, so I vote for doing --all -f straight away.

@betatim
Copy link
Member Author

betatim commented May 5, 2019

Using --all -f everywhere now.

@consideRatio
Copy link
Member

consideRatio commented May 5, 2019

Hmmm... One R pytest failed in build 1407

E                           docker.errors.BuildError: The command '/bin/sh -c R --quiet -e "install.packages('devtools', repos='https://mran.microsoft.com/snapshot/2018-02-01', method='libcurl')" && R --quiet -e "devtools::install_github('IRkernel/IRkernel', ref='0.8.11')" && R --quiet -e "IRkernel::installspec(prefix='$NB_PYTHON_PREFIX')"' returned a non-zero code: 1

@consideRatio
Copy link
Member

consideRatio commented May 5, 2019

And now both stencila-r and R pytest failed in build 1408:

stencila-r logs

In install.packages("devtools", repos = "https://mran.microsoft.com/snapshot/2018-02-01",  :
  installation of package ‘httr’ had non-zero exit status
> 
> 
> devtools::install_github('IRkernel/IRkernel', ref='0.8.11')
Error in loadNamespace(name) : there is no package called ‘devtools’
Calls: :: ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
Removing intermediate container ee8b36183ca1
The command '/bin/sh -c R --quiet -e "install.packages('devtools', repos='https://mran.microsoft.com/snapshot/2018-02-01', method='libcurl')" && R --quiet -e "devtools::install_github('IRkernel/IRkernel', ref='0.8.11')" && R --quiet -e "IRkernel::installspec(prefix='$NB_PYTHON_PREFIX')"' returned a non-zero code: 1

@betatim
Copy link
Member Author

betatim commented May 5, 2019

Restarted both failed builds. They have been failing over the last few days due to (I think) networking problems or slow servers. So for now hitting "restart", if this keeps happening we need to investigate how to make them more resilient.

@yuvipanda
Copy link
Collaborator

Checks pass now!

@consideRatio consideRatio merged commit fb5c4ef into jupyterhub:master May 5, 2019
@betatim betatim deleted the clean-up-during-conda-install branch May 5, 2019 21:10
markmo pushed a commit to markmo/repo2docker that referenced this pull request Jan 22, 2021
…install

[MRG] Remove the conda package cache as we can't hardlink to it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants