Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Dockerfile #674

Merged
merged 1 commit into from
Nov 29, 2016
Merged

Conversation

thaJeztah
Copy link
Member

This optimizes the Dockerfile a bit by;

  • putting the docs archives at the top to optimize caching and to
    prevent having to clone the github repository on each build. Note
    that '--no-cache' is needed to forcefully break the cache, but the
    archives should not frequently change
  • grouping RUN lines to reduce image size (adds/deletes in the same layer).
  • using a loop for the archived versions to reduce the amount of
    duplicated code.
  • using the local files for the current version of the docs instead
    of the git clone from GitHub. this makes it also use the right source
    instead of "master"
  • adding a .dockerignore to prevent busting the cache if not needed,
    and to prevent uploading the '.git' repository, which is not used for
    the "current" docs

Difference in size before/after;

REPOSITORY     TAG     IMAGE ID      CREATED         SIZE
docs           latest  36f6ad029e6a  3 minutes ago   1.722 GB
docs-orig      latest  4f1a3e3fda4f  16 minutes ago  3.344 GB

/cc @johndmulhausen PTAL, I built this and I think everything works as before, but let me know if there's issues in this change

@johndmulhausen
Copy link

johndmulhausen commented Nov 21, 2016

Deploy preview ready!

Built with commit 721f9d2

https://deploy-preview-674--docker-docs.netlify.com

@johndmulhausen
Copy link

johndmulhausen commented Nov 21, 2016

@thaJeztah
It certainly seems to finish its business quickly, but there are a few things:

  1. If I run these two commands with your proposed Dockerfile in the current directory:
docker build -t newdocs .
docker run -ti -p 4000:4000 newdocs

And then visit http://localhost:4000...

I do not get served the docs. :( I get a directory listing that only seems to have the svn/wget downloads showing:

Index of /

Name	Last modified	Size Parent Directory	2016/11/21 22:36	-

Dockerfile	2016/11/21 22:28	1408
engine/	2016/11/21 22:36	-

WEBrick/1.3.1 (Ruby/2.3.3/2016-11-21) at 0.0.0.0:4000
  1. This is the Dockerfile that runs in production and it is absolutely correct that it pulls from master and uses the checked-in files rather than anything local, and it should pull the latest from GitHub every time. That goes for the archives as well.

@thaJeztah
Copy link
Member Author

I do not get served the docs. :( I get a directory listing that only seems to have the svn/wget downloads showing:

Hm, that's weird, let me try again

it should pull the latest from GitHub every time.

For the current version of the docs; in the old situation, Docker Cloud does a checkout of the repository at commit X. Then builds the Dockerfile, which does a checkout of the GitHub repository of commit "X + random", so there's no relation whatsoever between the commit being built from, and the docs that are being published.

That goes for the archives as well.

Afaik, Docker Cloud / Docker Hub automated builds never use caching, so they will pull the latest version of the archives every time.

@sanscontext
Copy link
Contributor

Afaik, Docker Cloud / Docker Hub automated builds never use caching, so they will pull the latest version of the archives every time.

Caching is available in Docker Cloud builds now, FWIW. :) (Recently, as of the 8th.)

This optimizes the Dockerfile a bit by;

- putting the docs archives at the top to
  optimize caching and to prevent having
  to clone the github repository on each
  build. Note that '--no-cache' is needed
  to forcefully break the cache, but the
  archives should not frequently change
- grouping RUN lines to reduce image size.
- using a loop for the archived versions to
  reduce the amount of duplicated code.
- using the local files for the *current*
  version of the docs instead of the git
  clone from GitHub. this makes it also
  use the right source instead of "master"
- adding a .dockerignore to prevent busting
  the cache if not needed, and to prevent
  uploading the '.git' repository, which
  is not used for the "current" docs

Difference in size before/after;

    REPOSITORY     TAG     IMAGE ID      CREATED         SIZE
    docs           latest  36f6ad029e6a  3 minutes ago   1.722 GB
    docs-orig      latest  4f1a3e3fda4f  16 minutes ago  3.344 GB

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

@sanscontext ah, thanks! In that case, we need to make sure to disable it (because there's no way to detect changes in the remove repository from the Dockerfile, so we should always assume the cache is invalid)

@thaJeztah
Copy link
Member Author

@johndmulhausen I could not reproduce your issue. I tried building again, and the current docs worked for me, however the archives redirected to docs.docker.com. I changed the configuration so that building the current docs doesn't remove the archive (which is now in the output directory)

@bfirsh
Copy link
Contributor

bfirsh commented Nov 22, 2016

Just adding a +1 for the approach of copying code from the build context rather than pulling master from somewhere remote. This is the standard way of building Docker images.

The point of building images like this is so that you can consistently turn some code into an image with that code inside it. If you're pulling from master, you have no control over what code goes inside your image.

Docker Cloud will pull the latest code, then build an image with that latest code in the build context. So, when Docker Cloud is building from master, a git clone github.com/docker/docker.github.io and COPY . /... are equivalent.

However, the advantage of doing COPY . /... is that you can build images which aren't github.com/docker/docker.github.io#master – different branches, code from your local checkout, and so on.

@johndmulhausen
Copy link

Okay, gotcha re: using the local files vs re-cloning; this does seem a lot faster if I pull the whole PR down and run it with the docs in place alongside Dockerfile.

So how do we ensure that caching is disabled in Hub and Cloud's autobuild? If we seem sure that caching won't happen, then I'm reasonably comfortable proceeding. @sanscontext ?

@bfirsh
Copy link
Contributor

bfirsh commented Nov 23, 2016

You can disable this in the build settings:

screenshot 2016-11-23 15 30 19

Copy link

@mdlinville mdlinville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this approach! I forgot that you could use && to chain multiple commands together in a RUN invocation. Does this have any performance benefits? What happens if one of the commands fails? I assume it short-circuits the rest of the script and the docker build will fail.

I don't know if we document this for loop trick anywhere, but we definitely should!

@bfirsh
Copy link
Contributor

bfirsh commented Nov 24, 2016

Oh – even better, the Dockerfile could specify a commit to use from docker/docker, and changes can be vendored manually. That way, it'll always bust/use caches correctly, and a particular commit from docker/docker.github.io always builds exactly the same set of documentation. That's rather useful for keeping track of/rolling back deploys, and so on.

@thaJeztah
Copy link
Member Author

thaJeztah commented Nov 24, 2016

We had the specific commit in the previous docs, so not sure that's what
the team wants to have again.

I think ideally we should have a separate container for each archived
version of the docs.

There is no reason to rebuild each archive version if only "master"
(or another branch) changes, and by putting them in separate images,
a trigger can set on each branch to only rebuild those images if an
actual change is committed to these branches.

Something like;

  • master -> docs/docs.docker.com:latest + docs/docs.docker.com:v1.12
  • v1.11 -> docs/docs.docker.com:v1.11
  • v1.10 -> docs/docs.docker.com:v1.10
  • etc.

We can have a Dockerfile and a Dockerfile.deploy in master; the
Dockerfile is optimized for contributors, to build fast, and (e.g.)
has --watch enabled to allow live editing / preview, has "no-index",
does not have (Google) Analytics scripts enabled etc.

The Dockerfile.deploy is optimized for publication, and may have
additional configuration set (indexing enabled, Analytics, minified
scripts, images, etc).

We can use a simple proxy to route archive URL's to the correct
archives container;

  server {
      listen 80;
      server_name docs.docker.com;
      location / {
        proxy_pass http://docs-current;
      }
      location /v1.11/ {
        proxy_pass http://docs-archive-1-11;
      }
      location /v1.10/ {
        proxy_pass http://docs-archive-1-10;
      }
      // or matching
      location ~ ^/v1\.([1-9][1-9]).*$ {
        proxy_pass http://docs-archive-$1;
      }
    }

@mdlinville
Copy link

The reason we don't have a container per dockerfile is because everyone wanted to be able to have archive folders. Before, we were trying to route them to different containers by port. It was ugly. I don't think we want to go back to that. @johndmulhausen ?

@johndmulhausen
Copy link

A few things:

  1. I do not have the ability for repo-connected autobuilding to disable caching in Hub, because it's auto-wired to just pull the repo contents at that time, so that looks to be all good.
  2. We are already auto-building images of each of the archive versions, as well as any releases we tag. For example, we just released the Datacenter docs, which you can see with
docker run -ti -p 4000:4000 docs/docker.github.io:ucp2.0.1-dtr2.1.0

Change that tag to v1.4 etc and there you have it.

However, using these images in production was problematic for various reasons, the most important being, we were specifically told by infra that it was not a good idea to use a proxy to forward to those container endpoints because each endpoint was the result of a stack running Ruby and Jekyll - creating quite a big RAM footprint after you stack up 8 or 9 instances (each one being about 350 MB during runtime) and there is no end in sight for new versions, so that becomes needlessly over-engineered quickly.

It would be one thing if the end result of this were discrete services meant to run in parallel and we were a "real application" but we're just a static site trying to have versioned folders, which in infra's opinion isn't the best use case for a distributed-system-style, multi-container architecture. It was their recommendation, which I agreed with, that at runtime we should just have one server instance serving up a concatenated-together site that had everything in it. This also simplified the deployment/image story.

This all said, I think there is probably slimming we can do of this image even more, and I would always be happy to look at someone willing to step in and implement such a solution. But for now, I think I'm ready to give this a shot and see how it goes. :)

Thanks @thaJeztah @bfirsh @mstanleyjones -- let's merge and see how this goes!

@johndmulhausen johndmulhausen merged commit eebf584 into docker:master Nov 29, 2016
@johndmulhausen johndmulhausen self-assigned this Nov 29, 2016
@thaJeztah thaJeztah deleted the optimize-dockerfile branch November 29, 2016 00:54
@johndmulhausen
Copy link

Note: The previous build time before this PR was merged was 38 minutes (!). After this was merged: 14 minutes. Definite huge improvement -- thanks so much to @thaJeztah !

@bfirsh
Copy link
Contributor

bfirsh commented Nov 29, 2016

@johndmulhausen About the memory footprint - can't the docs be served by nginx or similar in production, or does it depend on the Jenkins server for some reason?

@thaJeztah
Copy link
Member Author

I was thinking the same yesterday; once the docs are built, it's just static files, so any server can be used (even a simple static server, such as https://github.com/mholt/caddy)

@johndmulhausen
Copy link

That's right, any server can, though we've stuck with the GitHub Pages gem for compatibility w/GitHub Pages and simplicity (bifurcating the build and authoring image structures, just to get the memory footprint down, just to shoehorn in multi-container architecture for a static site, just to save a few minutes of deploy time -- wasn't a super-compelling use case for me).

Staying within the lanes of a widely-adopted authoring package kept things easy when constructing the solution. Optimization, where the devil is in the details, is a whole other matter. I personally don't have much time for or interest in pursuing a multi-container solution for the archives, but I also always welcome contributors who want to help. If this is something that interests you, ping me privately and I'll get you set up.

A big bonus to me in all this is that our build/production systems are auditable enough for more people to pitch in, as you have so wonderfully here. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants