Optimize Dockerfile #674

thaJeztah · 2016-11-21T21:48:03Z

This optimizes the Dockerfile a bit by;

putting the docs archives at the top to optimize caching and to
prevent having to clone the github repository on each build. Note
that '--no-cache' is needed to forcefully break the cache, but the
archives should not frequently change
grouping RUN lines to reduce image size (adds/deletes in the same layer).
using a loop for the archived versions to reduce the amount of
duplicated code.
using the local files for the current version of the docs instead
of the git clone from GitHub. this makes it also use the right source
instead of "master"
adding a .dockerignore to prevent busting the cache if not needed,
and to prevent uploading the '.git' repository, which is not used for
the "current" docs

Difference in size before/after;

REPOSITORY     TAG     IMAGE ID      CREATED         SIZE
docs           latest  36f6ad029e6a  3 minutes ago   1.722 GB
docs-orig      latest  4f1a3e3fda4f  16 minutes ago  3.344 GB

/cc @johndmulhausen PTAL, I built this and I think everything works as before, but let me know if there's issues in this change

johndmulhausen · 2016-11-21T21:51:30Z

Deploy preview ready!

Built with commit 721f9d2

https://deploy-preview-674--docker-docs.netlify.com

johndmulhausen · 2016-11-21T22:38:55Z

@thaJeztah
It certainly seems to finish its business quickly, but there are a few things:

If I run these two commands with your proposed Dockerfile in the current directory:

docker build -t newdocs .
docker run -ti -p 4000:4000 newdocs

And then visit http://localhost:4000...

I do not get served the docs. :( I get a directory listing that only seems to have the svn/wget downloads showing:

Index of /

Name	Last modified	Size Parent Directory	2016/11/21 22:36	-

Dockerfile	2016/11/21 22:28	1408
engine/	2016/11/21 22:36	-

WEBrick/1.3.1 (Ruby/2.3.3/2016-11-21) at 0.0.0.0:4000

This is the Dockerfile that runs in production and it is absolutely correct that it pulls from master and uses the checked-in files rather than anything local, and it should pull the latest from GitHub every time. That goes for the archives as well.

thaJeztah · 2016-11-21T22:44:17Z

I do not get served the docs. :( I get a directory listing that only seems to have the svn/wget downloads showing:

Hm, that's weird, let me try again

it should pull the latest from GitHub every time.

For the current version of the docs; in the old situation, Docker Cloud does a checkout of the repository at commit X. Then builds the Dockerfile, which does a checkout of the GitHub repository of commit "X + random", so there's no relation whatsoever between the commit being built from, and the docs that are being published.

That goes for the archives as well.

Afaik, Docker Cloud / Docker Hub automated builds never use caching, so they will pull the latest version of the archives every time.

sanscontext · 2016-11-21T23:19:47Z

Afaik, Docker Cloud / Docker Hub automated builds never use caching, so they will pull the latest version of the archives every time.

Caching is available in Docker Cloud builds now, FWIW. :) (Recently, as of the 8th.)

This optimizes the Dockerfile a bit by; - putting the docs archives at the top to optimize caching and to prevent having to clone the github repository on each build. Note that '--no-cache' is needed to forcefully break the cache, but the archives should not frequently change - grouping RUN lines to reduce image size. - using a loop for the archived versions to reduce the amount of duplicated code. - using the local files for the *current* version of the docs instead of the git clone from GitHub. this makes it also use the right source instead of "master" - adding a .dockerignore to prevent busting the cache if not needed, and to prevent uploading the '.git' repository, which is not used for the "current" docs Difference in size before/after; REPOSITORY TAG IMAGE ID CREATED SIZE docs latest 36f6ad029e6a 3 minutes ago 1.722 GB docs-orig latest 4f1a3e3fda4f 16 minutes ago 3.344 GB Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2016-11-21T23:21:51Z

@sanscontext ah, thanks! In that case, we need to make sure to disable it (because there's no way to detect changes in the remove repository from the Dockerfile, so we should always assume the cache is invalid)

thaJeztah · 2016-11-21T23:23:41Z

@johndmulhausen I could not reproduce your issue. I tried building again, and the current docs worked for me, however the archives redirected to docs.docker.com. I changed the configuration so that building the current docs doesn't remove the archive (which is now in the output directory)

bfirsh · 2016-11-22T13:48:40Z

Just adding a +1 for the approach of copying code from the build context rather than pulling master from somewhere remote. This is the standard way of building Docker images.

The point of building images like this is so that you can consistently turn some code into an image with that code inside it. If you're pulling from master, you have no control over what code goes inside your image.

Docker Cloud will pull the latest code, then build an image with that latest code in the build context. So, when Docker Cloud is building from master, a git clone github.com/docker/docker.github.io and COPY . /... are equivalent.

However, the advantage of doing COPY . /... is that you can build images which aren't github.com/docker/docker.github.io#master – different branches, code from your local checkout, and so on.

johndmulhausen · 2016-11-23T01:58:03Z

Okay, gotcha re: using the local files vs re-cloning; this does seem a lot faster if I pull the whole PR down and run it with the docs in place alongside Dockerfile.

So how do we ensure that caching is disabled in Hub and Cloud's autobuild? If we seem sure that caching won't happen, then I'm reasonably comfortable proceeding. @sanscontext ?

bfirsh · 2016-11-23T15:30:40Z

You can disable this in the build settings:

mdlinville

+1 to this approach! I forgot that you could use && to chain multiple commands together in a RUN invocation. Does this have any performance benefits? What happens if one of the commands fails? I assume it short-circuits the rest of the script and the docker build will fail.

I don't know if we document this for loop trick anywhere, but we definitely should!

bfirsh · 2016-11-24T14:46:22Z

Oh – even better, the Dockerfile could specify a commit to use from docker/docker, and changes can be vendored manually. That way, it'll always bust/use caches correctly, and a particular commit from docker/docker.github.io always builds exactly the same set of documentation. That's rather useful for keeping track of/rolling back deploys, and so on.

thaJeztah · 2016-11-24T15:27:02Z

We had the specific commit in the previous docs, so not sure that's what
the team wants to have again.

I think ideally we should have a separate container for each archived
version of the docs.

There is no reason to rebuild each archive version if only "master"
(or another branch) changes, and by putting them in separate images,
a trigger can set on each branch to only rebuild those images if an
actual change is committed to these branches.

Something like;

master -> docs/docs.docker.com:latest + docs/docs.docker.com:v1.12
v1.11 -> docs/docs.docker.com:v1.11
v1.10 -> docs/docs.docker.com:v1.10
etc.

We can have a Dockerfile and a Dockerfile.deploy in master; the
Dockerfile is optimized for contributors, to build fast, and (e.g.)
has --watch enabled to allow live editing / preview, has "no-index",
does not have (Google) Analytics scripts enabled etc.

The Dockerfile.deploy is optimized for publication, and may have
additional configuration set (indexing enabled, Analytics, minified
scripts, images, etc).

We can use a simple proxy to route archive URL's to the correct
archives container;

  server {
      listen 80;
      server_name docs.docker.com;
      location / {
        proxy_pass http://docs-current;
      }
      location /v1.11/ {
        proxy_pass http://docs-archive-1-11;
      }
      location /v1.10/ {
        proxy_pass http://docs-archive-1-10;
      }
      // or matching
      location ~ ^/v1\.([1-9][1-9]).*$ {
        proxy_pass http://docs-archive-$1;
      }
    }

mdlinville · 2016-11-29T00:03:42Z

The reason we don't have a container per dockerfile is because everyone wanted to be able to have archive folders. Before, we were trying to route them to different containers by port. It was ugly. I don't think we want to go back to that. @johndmulhausen ?

johndmulhausen · 2016-11-29T00:34:00Z

A few things:

I do not have the ability for repo-connected autobuilding to disable caching in Hub, because it's auto-wired to just pull the repo contents at that time, so that looks to be all good.
We are already auto-building images of each of the archive versions, as well as any releases we tag. For example, we just released the Datacenter docs, which you can see with

docker run -ti -p 4000:4000 docs/docker.github.io:ucp2.0.1-dtr2.1.0

Change that tag to v1.4 etc and there you have it.

However, using these images in production was problematic for various reasons, the most important being, we were specifically told by infra that it was not a good idea to use a proxy to forward to those container endpoints because each endpoint was the result of a stack running Ruby and Jekyll - creating quite a big RAM footprint after you stack up 8 or 9 instances (each one being about 350 MB during runtime) and there is no end in sight for new versions, so that becomes needlessly over-engineered quickly.

It would be one thing if the end result of this were discrete services meant to run in parallel and we were a "real application" but we're just a static site trying to have versioned folders, which in infra's opinion isn't the best use case for a distributed-system-style, multi-container architecture. It was their recommendation, which I agreed with, that at runtime we should just have one server instance serving up a concatenated-together site that had everything in it. This also simplified the deployment/image story.

This all said, I think there is probably slimming we can do of this image even more, and I would always be happy to look at someone willing to step in and implement such a solution. But for now, I think I'm ready to give this a shot and see how it goes. :)

Thanks @thaJeztah @bfirsh @mstanleyjones -- let's merge and see how this goes!

johndmulhausen · 2016-11-29T00:56:50Z

Note: The previous build time before this PR was merged was 38 minutes (!). After this was merged: 14 minutes. Definite huge improvement -- thanks so much to @thaJeztah !

bfirsh · 2016-11-29T09:23:13Z

@johndmulhausen About the memory footprint - can't the docs be served by nginx or similar in production, or does it depend on the Jenkins server for some reason?

thaJeztah · 2016-11-29T13:38:19Z

I was thinking the same yesterday; once the docs are built, it's just static files, so any server can be used (even a simple static server, such as https://github.com/mholt/caddy)

johndmulhausen · 2016-11-29T19:53:29Z

That's right, any server can, though we've stuck with the GitHub Pages gem for compatibility w/GitHub Pages and simplicity (bifurcating the build and authoring image structures, just to get the memory footprint down, just to shoehorn in multi-container architecture for a static site, just to save a few minutes of deploy time -- wasn't a super-compelling use case for me).

Staying within the lanes of a widely-adopted authoring package kept things easy when constructing the solution. Optimization, where the devil is in the details, is a whole other matter. I personally don't have much time for or interest in pursuing a multi-container solution for the archives, but I also always welcome contributors who want to help. If this is something that interests you, ping me privately and I'll get you set up.

A big bonus to me in all this is that our build/production systems are auditable enough for more people to pitch in, as you have so wonderfully here. 👍

thaJeztah force-pushed the optimize-dockerfile branch from 5dd2c11 to 721f9d2 Compare November 21, 2016 23:20

bfirsh mentioned this pull request Nov 23, 2016

Automatically generate Engine API documentation #606

Merged

mdlinville reviewed Nov 23, 2016

View reviewed changes

johndmulhausen merged commit eebf584 into docker:master Nov 29, 2016

johndmulhausen self-assigned this Nov 29, 2016

thaJeztah deleted the optimize-dockerfile branch November 29, 2016 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Dockerfile #674

Optimize Dockerfile #674

thaJeztah commented Nov 21, 2016

johndmulhausen commented Nov 21, 2016 •

edited

Loading

johndmulhausen commented Nov 21, 2016 •

edited

Loading

thaJeztah commented Nov 21, 2016

sanscontext commented Nov 21, 2016

thaJeztah commented Nov 21, 2016

thaJeztah commented Nov 21, 2016

bfirsh commented Nov 22, 2016 •

edited

Loading

johndmulhausen commented Nov 23, 2016

bfirsh commented Nov 23, 2016

mdlinville left a comment

bfirsh commented Nov 24, 2016

thaJeztah commented Nov 24, 2016 •

edited

Loading

mdlinville commented Nov 29, 2016

johndmulhausen commented Nov 29, 2016

johndmulhausen commented Nov 29, 2016

bfirsh commented Nov 29, 2016

thaJeztah commented Nov 29, 2016

johndmulhausen commented Nov 29, 2016

Optimize Dockerfile #674

Optimize Dockerfile #674

Conversation

thaJeztah commented Nov 21, 2016

johndmulhausen commented Nov 21, 2016 • edited Loading

johndmulhausen commented Nov 21, 2016 • edited Loading

thaJeztah commented Nov 21, 2016

sanscontext commented Nov 21, 2016

thaJeztah commented Nov 21, 2016

thaJeztah commented Nov 21, 2016

bfirsh commented Nov 22, 2016 • edited Loading

johndmulhausen commented Nov 23, 2016

bfirsh commented Nov 23, 2016

mdlinville left a comment

Choose a reason for hiding this comment

bfirsh commented Nov 24, 2016

thaJeztah commented Nov 24, 2016 • edited Loading

mdlinville commented Nov 29, 2016

johndmulhausen commented Nov 29, 2016

johndmulhausen commented Nov 29, 2016

bfirsh commented Nov 29, 2016

thaJeztah commented Nov 29, 2016

johndmulhausen commented Nov 29, 2016

johndmulhausen commented Nov 21, 2016 •

edited

Loading

johndmulhausen commented Nov 21, 2016 •

edited

Loading

bfirsh commented Nov 22, 2016 •

edited

Loading

thaJeztah commented Nov 24, 2016 •

edited

Loading