Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conda clean command - what flags should we use? #861

Closed
consideRatio opened this issue May 4, 2019 · 22 comments
Closed

conda clean command - what flags should we use? #861

consideRatio opened this issue May 4, 2019 · 22 comments
Labels
type:Maintenance A proposed enhancement to how we maintain this project

Comments

@consideRatio
Copy link
Collaborator

consideRatio commented May 4, 2019

@echowhisky noted that conda clean with the -f or --force-pkgs-dirs flag can do significant reduction in image sizes, but we do not grasp of the consequences of doing so. I created this post to have a gather discussion regarding this and what flags we should use in this command!

I found the following documentation about the flag:
https://conda.io/projects/conda/en/latest/commands/clean.html#Removal%20Targets

In the docs they write:

Remove all writable package caches. This option is not included with the --all flag. WARNING: This will break environments with packages installed using symlinks back to the package cache.

The concrete question I mind that we need to answer is more specifically, what flags do we pass to the clean command? The flags detail what to cleanup.

Currently we are using the -tipsy flag, but I fail to find any documentation about this flag, I think it is deprecated in favor of other flags. So, we should probably update this no matter what.

Usage of conda clean building up to datascience-notebook

Base notebook

conda clean -tipsy && \
rm -rf /home/$NB_USER/.cache/yarn && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
# Install Tini
RUN conda install --quiet --yes 'tini=0.18.0' && \
conda list tini | grep tini | tr -s ' ' | cut -d ' ' -f 1,2 >> $CONDA_DIR/conda-meta/pinned && \
conda clean -tipsy && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
# Install Jupyter Notebook, Lab, and Hub
# Generate a notebook server config
# Cleanup temporary files
# Correct permissions
# Do all this in a single RUN command to avoid duplicating all of the
# files across image layers when the permissions change
RUN conda install --quiet --yes \
'notebook=5.7.8' \
'jupyterhub=0.9.6' \
'jupyterlab=0.35.5' && \
conda clean -tipsy && \

Minimal notebook

None

Scipy notebook

conda clean -tipsy && \

Datascience notebook

conda clean -tipsy && \

UPDATE

Woops, so tipsy is simply the list of flags -t -i -p -s -y.

  • -t or --tarballs
  • -i or --index-cache
  • -p or --packages
  • -s or ... eh that was not documented, it probably relates to --source-cache and this issue
  • -y or --yes

Then I ask myself.

  1. Should we also include the -l or --lock flag that is part of the --all flag?
  2. Should we also include the -f or force-pkgs-dirs that is not part of the --all flag?
  3. Should we remove the deprecated -s or --source-cache flag that is deprecated?
@echowhisky
Copy link
Contributor

echowhisky commented May 4, 2019

You beat me to it. I had planned on making this issue, and had also noticed the -s flag seems deprecated. It's likely that conda clean --all -y is what is wanted, but it's worth doing some testing between that and including -f and see how it goes.

@parente
Copy link
Member

parente commented May 4, 2019

IIRC, @jakirkham and I setup some of the initial clean commands a long time ago, probably when we were still on conda 3.x? I have few doubts that there are better options today.

#406 has been open for some time as a thought about how to test the behavior of the various packages and language kernels. Maybe putting some lightweight notebook tests in place before making the change would provide some confidence that the changes are not detrimental? Or maybe that's too much work. Perhaps some simple unit test additions against the base container to ensure that a package can be imported (e.g., jupyter) and a new package can be installed and then imported in the built image would be enough?

@consideRatio
Copy link
Collaborator Author

Perhaps we could add some tests to the datascience-notebook where we do something similar to the test below from base-notebook, but try to do a command to verify various import statements functions.

def test_uid_change(container):
"""Container should change the UID of the default user."""
c = container.run(
tty=True,
user='root',
environment=['NB_UID=1010'],
command=['start.sh', 'bash', '-c', 'id && touch /opt/conda/test-file']
)
# usermod is slow so give it some time
c.wait(timeout=120)
assert 'uid=1010(jovyan)' in c.logs(stdout=True).decode('utf-8')

echowhisky added a commit to echowhisky/docker-stacks that referenced this issue May 4, 2019
Following the discussion in issue jupyter#861, this changes the `conda clean`
flags.
@echowhisky
Copy link
Contributor

With the -tipsy flags, because of the -s the following error shows up in the build output for the base container.

WARNING: 'conda clean --source-cache' is deprecated.
Use 'conda build purge-all' to remove source cache files.

The problem is conda build isn't installed, and I'm not sure it makes sense to install something extra to try to reduce size.

I have a branch for this issue now with the initial changes. I'll have an additional commit to test the -f option, but for now wanted to validate just the --all flag as a replacement for -tipsy. It builds to exactly the same size in my testing.

@consideRatio
Copy link
Collaborator Author

@echowhisky I reduced my 3.6 GB image that is a bloated datascience-notebook image built from scratch to 3.4 GB by going for --all -f as command line arguments instead of -tipsy.

@parente parente added the type:Maintenance A proposed enhancement to how we maintain this project label May 4, 2019
@consideRatio
Copy link
Collaborator Author

consideRatio commented May 4, 2019

@parente got an idea on how to test various python import statements as part of a test? If I could get a bit confident about how it makes sense to go about it I'd be happy to add such test along with a PR to change the conda clean command parameters.

Furthermore, would various import statements to various packages be good enough of a test if we broke something? I'm not confident about these things at all...

@betatim - it is my understanding from what you said, without technically understanding it, that you think it would make sense to have the -f flag for docker images, is this correct?

@consideRatio
Copy link
Collaborator Author

My docker image built with --all -f reduced in file size while remained functional for various packages with lots of dependencies.

@parente
Copy link
Member

parente commented May 4, 2019

@consideRatio If you're planning on write a pytest, you could mimic what we do in the base notebook tests to launch a docker container and run ipython -c "import whatever" or the equivalent to test what you want. If you want to take the approach in #406, you'd write a notebook and we'd add some mechanism to invoke nbconvert with the execute flag to run the notebook top to bottom and error if any cell failed.

@consideRatio
Copy link
Collaborator Author

@parente hmmm and the #406 approach would be to do something within the makefile? I wonder about how we would make the notebook accessible to the container if we go about using a notebook no matter if we use makefile stuff or pytest.

Can you recommend an approach for me? ;)

echowhisky added a commit to echowhisky/docker-stacks that referenced this issue May 4, 2019
This commit adds the additional `-f` force command to all uses of `conda
clean --all` through the repo. Size should be smaller, but still testing
if anything breaks. See issue jupyter#861.
@parente
Copy link
Member

parente commented May 4, 2019

@consideRatio I would start with a single pytest setup like https://github.com/jupyter/docker-stacks/blob/master/base-notebook/test/test_container_options.py#L8 where the test body is something like the following pseudocode:

def test_pandas(container):
    c = container.run(
        command=['start.sh', 'ipython', '-c', '"import pandas"']
    )
    rv = c.wait(timeout=5)
    assert rv == 0 or rv["StatusCode"] == 0

def test_conda_install(container):
    """Should install a conda library and then import it without error"""
    # exercise for the reader :P

This page in the doc (https://jupyter-docker-stacks.readthedocs.io/en/latest/contributing/tests.html) describes where to put the test file so that it executes against the correct image. You'll need to create a pipenv/venv/conda env containing the contents of the requirements-dev.txt file at the root of the project in order to run the tests locally.

@betatim
Copy link
Member

betatim commented May 4, 2019

@betatim - it is my understanding from what you said, without technically understanding it, that you think it would make sense to have the -f flag for docker images, is this correct?

We have been doing what I think is the equivalent of -f in repo2docker for a few weeks(?) now and so far no one has complained. This means that at least for repo2docker I think we will keep deleting the package directory with the explicit rm command we have. Eventually we should switch to --all -y -f but it isn't urgent. However take into account that my last comment after poking around docker images, layers and hard-links concluded: that I think there is something I don't understand because the world is inconsistent (hardlinks work within a layer, but deleting one link reduces the image size which shouldn't happen because I think all files affected here have at least two links to them).

@jakirkham
Copy link
Member

I would not be surprised if there is a bug in how Docker is reporting hard-links in image size. This has happened before ( moby/moby#9283 ).

@consideRatio
Copy link
Collaborator Author

I finally learned more about hard-links: https://en.wikipedia.org/wiki/Hard_link

@echowhisky
Copy link
Contributor

For comparison, in the base-notebook container run without -f the /opt/conda/ directory has the following sizes:

jovyan@6e6dbadc1fe3:/opt/conda$ du -sch *      
182M    bin                                    
2.4M    compiler_compat                        
8.0K    condabin                               
7.3M    conda-meta                             
4.0K    envs                                   
44K     etc                                    
13M     include                                
312M    lib                                    
8.0K    LICENSE.txt                            
56K     man                                    
135M    pkgs                                   
412K    sbin                                   
3.5M    share                                  
12K     shell                                  
24K     ssl                                    
12K     x86_64-conda_cos6-linux-gnu            
655M    total                                                                            

And then when I built the same container with the -f flag:

jovyan@145c591d5eea:/opt/conda$ du -sch *                     
182M    bin                                                   
2.4M    compiler_compat                                       
8.0K    condabin                                              
7.3M    conda-meta                                            
4.0K    envs                                                  
44K     etc                                                   
13M     include                                               
312M    lib                                                   
8.0K    LICENSE.txt                                           
56K     man                                                   
796K    sbin                                                  
39M     share                                                 
24K     shell                                                 
336K    ssl                                                   
12K     x86_64-conda_cos6-linux-gnu                           
556M    total        

Even though the pkgs directory is gone in the second one, some of the other directories appear to increase in size (sbin, share, ssl). They didn't really. This is due to hard-links. du is smart enough to only count a hard-linked file once when summarizing a directory. In the first example, it found the hard-linked files in pkgs first and counted their size, ignoring it as part of the summaries for the later directories. With pkgs gone in the second instance, du counts them where it finds them causing an appearance of size increase.

Not sure any of this is super important, but was a hard-linked related artifact that tripped me up at first.

In the end, using -f makes the size of the base-notebook container (on disk, not the compressed container size) about 100M less. That seems good.

The issues with the R environment failing seems to have resolved, but if it turns back up it might be fixable by using the --copy flag with conda install if it is being caused by some soft-linking issue.

I've updated my fork's commits (167c686) to be in sync with the latest updates to the master (2662627) but haven't submitted a PR yet, waiting to see if there's additional testing coming. If I can get my head around it a bit better and carve out the time, I'll try to contribute to the testing pieces as well.

@jakirkham
Copy link
Member

@echowhisky, do you think we could raise a simplified version of this to the Docker team as an issue and see if they have any thoughts?

@parente
Copy link
Member

parente commented May 6, 2019

@echowhisky @consideRatio Given what we found in repo2docker, I think it's OK to proceed with the -f addition without waiting for tests.

@betatim
Copy link
Member

betatim commented May 6, 2019

@jakirkham checkout jupyterhub/repo2docker#666 (comment) for a "minimal" example on hardlinks. Seems like docker images reports the numbers I'd expect it to report.

I think I read somewhere in the conda docs that it will try to hardlink what it can from pkgs but that not everything in that repo can be hardlinked and instead will be copied/symlinked. So it could well be that cleaning out that directory "only" saves 100MB.

@echowhisky
Copy link
Contributor

echowhisky commented May 6, 2019

@jakirkham - the info I quoted above wasn't a docker issue, just the normal reduction from tweaking the conda clean flags and standard behavior of hard-links. I haven't tested beyond the base-notebook container for size savings, but went ahead and submitted the PR.

There are very likely some additional size savings that can be squeezed out, possibly using some other techniques like cleaning out all of the compiled python (.pyc) files and some of the other pieces mentioned in the conversation you referenced, but that probably deserves its own issue thread.

@parente
Copy link
Member

parente commented May 12, 2019

If the Docker Hub tag size reports are to be believed, PR #867 caused the following changes in image size, starting in tag 4d7dd95017ed:

new_size_mb old_size_mb delta_mb
all-spark-notebook 2145.16 2145.16 0
base-notebook 216.035 259.499 -43.4638
datascience-notebook 2047.63 2115.08 -67.4586
minimal-notebook 1040.8 1083.59 -42.7885
pyspark-notebook 1809.2 1809.2 0
r-notebook 1472.44 1530.85 -58.4122
scipy-notebook 1384.29 1444.94 -60.6465
tensorflow-notebook 1629.26 1558.54 70.7198

The Spark images have not changed in size because they failed to rebuild on Docker Hub. (#871 is addressing). I don't have an explanation for how the tensorflow image increased in size, but maybe there's evidence in in the build manifests (https://github.com/jupyter/docker-stacks/wiki/tensorflow-notebook-4d7dd95017ed vs https://github.com/jupyter/docker-stacks/wiki/tensorflow-notebook-2662627f26e0).

I used this notebook / binder to fetch the stats from Docker Hub.

@consideRatio
Copy link
Collaborator Author

consideRatio commented May 12, 2019

Thanks for the summary @parente ! Closing time?

No clue about how tensorflow-notebook icnreased in size, perhaps there was some package that got itself a version bump?

@parente
Copy link
Member

parente commented May 12, 2019

Thanks for the discussion and work put in here, folks. Less is more FTW!

huweiATgithub added a commit to huweiATgithub/deep that referenced this issue Jul 31, 2019
change flags on "conda clean" from "-tipsy" to "--all -f -y". See jupyter/docker-stacks#861.
thaume pushed a commit to metriqteam/docker-stacks that referenced this issue Mar 8, 2020
Following the discussion in issue jupyter#861, this changes the `conda clean`
flags.
thaume pushed a commit to metriqteam/docker-stacks that referenced this issue Mar 8, 2020
This commit adds the additional `-f` force command to all uses of `conda
clean --all` through the repo. Size should be smaller, but still testing
if anything breaks. See issue jupyter#861.
thaume pushed a commit to metriqteam/docker-stacks that referenced this issue Mar 9, 2020
Following the discussion in issue jupyter#861, this changes the `conda clean`
flags.
thaume pushed a commit to metriqteam/docker-stacks that referenced this issue Mar 9, 2020
This commit adds the additional `-f` force command to all uses of `conda
clean --all` through the repo. Size should be smaller, but still testing
if anything breaks. See issue jupyter#861.
@jmdelahanty
Copy link

Hello everyone! I just discovered this thread after running into the conda-build suggestion in my terminal. I wanted to message here to double check that using --all -y flags is the best practice method over using the -tipsy flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:Maintenance A proposed enhancement to how we maintain this project
Projects
None yet
Development

No branches or pull requests

6 participants