Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4115] Multi-staging Aiflow Docker image [Step 1/3] #4936

Merged
merged 1 commit into from
Apr 13, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# NOTE! This docker ignore uses recommended technique
# Where everything is excluded by default and you deliberately
# Add only those directories/files you need. This is very useful
# To make sure that Docker context is always the same on any machine
# So that generated files are not accidentally added to the context
# This allows Docker's `COPY .` to behave in predictable way

# Ignore everything
**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore everything and then add a few things back in? The list of things we dont want in isn't that big.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already given a talk on that actually I plan to write blog post :).

It's a practice I now absolutely recommend. I read it in many places as recommendation (for example here: https://youknowfordevs.com/2018/12/07/getting-control-of-your-dockerignore-files.html). But I experienced the effect of it in airflow. Besides longer time to build, this is a problem of cache invalidation when you do 'COPY .' .

The thing is that in this case the whole context has to match when you run rebuild of the image and if you add everything by default then every single file generated or added by accident in your source dir will cause context invalidation.

We already generate a lot of stuff in the sources (node_modules, 'static' and a number of others). When you do local development you often run stuff (such as document generation) locally and they introduce a lot of garbage in your context and invalidate the cache (thus pretty much every single time you restart dockerfile from the COPY . step no matter if the actual sources changed or not.

Moreover - you are not able to foresee what other stuff will be generated in the sources in the future. So no matter how hard you try and exclude everything you do not want, somebody few months from now might add a single generated file (or likely directory) that will cause subsequent frequent cache invalidations. Therefore "exclude everything" and then "add what you need" works much better. There is no chance you forget to add something important then - because your tests will fail.

This is actually the reason I now deliberately build Docker image and run tests from the binary image rather than using sources mounted (as it was before). This way your full docker image is tested. Previously some files could be ignored on build but then mounted from sources - which is quite bad practice - because it does not test the Docker image but some hybrid of Docker image and mounted sources (which are not necessarily the same).

One - hugely important additional thing - making context as minimal as possible is a huge time saver for local builds. What happens under the hood when Docker build runs, the whole context is compressed/packed and sent to the engine rather than used from local sources. This is the reason why Docker build sometimes pauses for many seconds until build starts - if you see it, it usually means you have not done good job in excluding some generated binaries. If you don't ignore such generated files (especially node_modules which are huge/lots of files) then your context might grow very large (and uncontrollably).

Excluding ** is all but guarantee that the context will not grow accidentally and uncontrollably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ashb ashb Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct with the duplicates. I reviewed and removed it.

There is little worry that we will forget something - the tests will catch it (and it should rarely be needed - adding a new folder or file at the top level should be really rare event)

And yes - you need those extra specs because they will exclude what should not be part of those included dirs above. The pattern I follow is (and I saw it suggested by several people):

  1. ** -> Exclude everything
  2. !/airflow -> include only those top-level folders/top-level files that you need
  3. **/pycache/ -> from those folders that you included in 2) exclude all the generated content that follows common patterns for generated stuff (like *.pyc file or /pycache/ folders)

I feel very strong about it - having to debug the context changes (and figuring out what the hell changed to trigger copying of the files again) for many hours. Example: only after a while I realised that there is a generated folder in docs which is _build next to _template (which is not generated, thought the name kind of suggest it). And just generating the docs locally triggered rebuild of Dockerfile (that's why I did not include the whole docs folder :) ). Otherwise I would have to specifically exclude that particular folder (if I knew that I have to do it). I really think adding stuff deliberately to Docker context makes much more sense.

Alternative is to move ALL generated content to specific directory, and exclude that directory - but this is not always possible (node_modules) and from my experience people will not "stick" to it - they will start generating random files in different folders. Of course we are not protected if someone starts generating "xyz" file/folder inside "airflow" folder. But it's rather unlikely someone will do something that crazy. I think generating something in sources is rather bad practice and you do it only if you cannot do it otherwise (that's one of the parts I don't like about python - .pyc files are generated next to actual source files. It reminds me of times when I used RCS versioning system and we had source.c,v next to every file.... BRRR)

But to be honest - It was much bigger problem and I eventually introduced checksums for important files (setup/docker) to check if we need to run docker build at all.

But still I think it makes it much more resilient for future changes that might produce uncontrollable and unpredictable context "growth".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give some more examples where things could go south. When I run unit tests from the IDE, the logs by default are written in the working directory - which is by default airflow's root folder. They are now ignored with .gitignore but when docker build is executed the logs are potentially big and should not be sent as part of the context. It's safer to exclude those by default.


# Allow only these directories
!airflow
!dags
!dev
#!/docs - we do not need docs in the Docker!
!licenses
!scripts
!tests

!.coveragerc
!.rat-excludes
!LICENSE
!MANIFEST.in
!NOTICE

# Avoid triggering context change on README change (new companies using Airflow)
# So please do not uncomment this line ;)
# !README.md

# Setup/version configuration
!setup.cfg
!setup.py

# Entrypoint script
!scripts/docker/entrypoint.sh

# Now - ignore unnecessary files inside allowed directories
# This goes after the allowed directories

# Git version is dynamically generated
airflow/git_version

# Exclude static www files generated bu NPM
airflow/www/static/coverage
airflow/www/static/dist
airflow/www/node_modules

# Exclude link to docs
airflow/www/static/docs

# Exclude python generated files
**/__pycache__/
**/*.py[cod]
**/*$py.class
**/.pytest_cache/
**/env/
**/build/
**/develop-eggs/
**/dist/
**/downloads/
**/eggs/
**/.eggs/
**/lib/
**/lib64/
**/parts/
**/sdist/
**/var/
**/wheels/
**/*.egg-info/
**/.installed.cfg
**/*.egg

# Exclude temporary vi files
**/*~

# Exclude output files
**/*.out
**/hive_scratch_dir/

# Exclude auto-generated Finder files on Mac OS
**/.DS_Store
potiuk marked this conversation as resolved.
Show resolved Hide resolved
**/Thumbs.db
23 changes: 23 additions & 0 deletions .hadolint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
ignored:
- DL3006
- DL3008
- DL3005
- DL3013
5 changes: 4 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,14 @@ jobs:
stage: pre-test
install: skip
script: scripts/ci/6-check-license.sh
- name: Lint Dockerfile
stage: pre-test
install: skip
script: scripts/ci/ci_lint_dockerfile.sh
- name: Check docs
stage: pre-test
install: pip install -e .[doc]
script: docs/build.sh

cache:
directories:
- $HOME/.wheelhouse/
Expand Down
205 changes: 172 additions & 33 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,46 +13,185 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# WARNING: THIS DOCKERFILE IS NOT INTENDED FOR PRODUCTION USE OR DEPLOYMENT.
#
# Base image for the whole Docker file
ARG APT_DEPS_IMAGE="airflow-apt-deps"
ARG PYTHON_BASE_IMAGE="python:3.6-slim"
############################################################################################################
# This is base image with APT dependencies needed by Airflow. It is based on a python slim image
# Parameters:
# PYTHON_BASE_IMAGE - base python image (python:x.y-slim)
############################################################################################################
FROM ${PYTHON_BASE_IMAGE} as airflow-apt-deps

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]
potiuk marked this conversation as resolved.
Show resolved Hide resolved

# Need to repeat the empty argument here otherwise it will not be set for this stage
# But the default value carries from the one set before FROM
ARG PYTHON_BASE_IMAGE
ENV PYTHON_BASE_IMAGE=${PYTHON_BASE_IMAGE}

ARG AIRFLOW_VERSION="2.0.0.dev0"
ENV AIRFLOW_VERSION=$AIRFLOW_VERSION

# Print versions
RUN echo "Base image: ${PYTHON_BASE_IMAGE}"
RUN echo "Airflow version: ${AIRFLOW_VERSION}"

# Make sure noninteractie debian install is used and language variab1les set
ENV DEBIAN_FRONTEND=noninteractive LANGUAGE=C.UTF-8 LANG=C.UTF-8 LC_ALL=C.UTF-8 \
LC_CTYPE=C.UTF-8 LC_MESSAGES=C.UTF-8

# By increasing this number we can do force build of all dependencies
ARG DEPENDENCIES_EPOCH_NUMBER="1"
# Increase the value below to force renstalling of all dependencies
ENV DEPENDENCIES_EPOCH_NUMBER=${DEPENDENCIES_EPOCH_NUMBER}

# Install curl and gnupg2 - needed to download nodejs in next step
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
curl gnupg2 \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*


# Install basic apt dependencies
RUN curl -sL https://deb.nodesource.com/setup_10.x | bash - \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are splitting out to multi-stage images it would be nice to node and the asset build stage as it's own step so that in the final image we have

COPY --from=airflow-assets /opt/airflow/airflow/www/static/dist /opt/airflow/airflow/www/static/dist

No node, no node modules in final image that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would prevent you from re-running npm ci/npm run prod when you work with mounted sources. It's not really acceptable for local development with the CI image.

I think it's good to keep it - at the very least - for the CI image. I can think about some optional step to build it as another stage for "Airflow" image - but I think it's not really good idea. I will think a bit more about it, but as I see it, it would complicate the whole Dockerfile and image a lot with little gain. I'd rather leave that in for now.

Especially that we have bright future. I think it's a good candidate for upcoming changes in Docker - especially Buildkit. There is a new, experimental feature with mount type in RUN command. It could be used for that purpose. Secret type seems especially interesting as it prevents the layer generated by a RUN command to ever appear in the final image (it will be present during the build):
https://github.com/moby/buildkit/blob/b521aae3ea1e1bb73298481ac538f9880ccd3854/frontend/dockerfile/docs/experimental.md#run---mounttypesecret

I already saw people commenting that they would use secret for that very purpose you described. It was foreseen for other purpose (to keep secretes from leaking) but maybe in the final buildkit release there will be a new type "ignored" or smth. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, on CI this makes sense.

&& apt-get update \
&& apt-get install -y --no-install-recommends \
# Packages to install \
libsasl2-dev freetds-bin build-essential sasl2-bin \
libsasl2-2 libsasl2-dev libsasl2-modules \
default-libmysqlclient-dev apt-utils curl rsync netcat locales \
freetds-dev libkrb5-dev libssl-dev libffi-dev libpq-dev git \
nodejs gosu sudo \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

RUN adduser airflow \
&& echo "airflow ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/airflow \
&& chmod 0440 /etc/sudoers.d/airflow

############################################################################################################
# This is the target image - it installs PIP and NPN dependencies including efficient caching
# mechanisms - it might be used to build the bare airflow build or CI build
# Parameters:
# APT_DEPS_IMAGE - image with APT dependencies. It might either be base deps image with airflow
# dependencies or CI deps image that contains also CI-required dependencies
############################################################################################################
FROM ${APT_DEPS_IMAGE} as main

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

WORKDIR /opt/airflow

RUN echo "Airflow version: ${AIRFLOW_VERSION}"

ARG APT_DEPS_IMAGE
ENV APT_DEPS_IMAGE=${APT_DEPS_IMAGE}

ARG AIRFLOW_HOME=/opt/airflow
ENV AIRFLOW_HOME=${AIRFLOW_HOME}

FROM python:3.6-slim
SHELL ["/bin/bash", "-xc"]
RUN mkdir -pv ${AIRFLOW_HOME} \
&& chown -R airflow.airflow ${AIRFLOW_HOME}

ENV AIRFLOW_HOME=/usr/local/airflow
ARG AIRFLOW_DEPS="all"
ARG PYTHON_DEPS=""
ARG BUILD_DEPS="freetds-dev libkrb5-dev libssl-dev libffi-dev libpq-dev git"
ARG APT_DEPS="libsasl2-dev freetds-bin build-essential default-libmysqlclient-dev apt-utils curl rsync netcat locales"
# Increase the value here to force reinstalling Apache Airflow pip dependencies
ARG PIP_DEPENDENCIES_EPOCH_NUMBER="1"
ENV PIP_DEPENDENCIES_EPOCH_NUMBER=${PIP_DEPENDENCIES_EPOCH_NUMBER}

ENV PATH="$HOME/.npm-packages/bin:$PATH"
# Optimizing installation of Cassandra driver
# Speeds up building the image - cassandra driver without CYTHON saves around 10 minutes
ARG CASS_DRIVER_NO_CYTHON="1"
# Build cassandra driver on multiple CPUs
ARG CASS_DRIVER_BUILD_CONCURRENCY="8"

RUN set -euxo pipefail \
&& apt update \
&& if [ -n "${APT_DEPS}" ]; then apt install -y $APT_DEPS; fi \
&& curl -sL https://deb.nodesource.com/setup_10.x | bash - \
&& apt update \
&& apt install -y nodejs \
&& apt autoremove -yqq --purge \
&& apt clean
ENV CASS_DRIVER_BUILD_CONCURRENCY=${CASS_DRIVER_BUILD_CONCURRENCY}
ENV CASS_DRIVER_NO_CYTHON=${CASS_DRIVER_NO_CYTHON}
potiuk marked this conversation as resolved.
Show resolved Hide resolved

COPY . /opt/airflow/
# By default PIP install is run without cache to make image smaller
ARG PIP_NO_CACHE_DIR="true"
ENV PIP_NO_CACHE_DIR=${PIP_NO_CACHE_DIR}
RUN echo "Pip no cache dir: ${PIP_NO_CACHE_DIR}"

# PIP version used to install dependencies
ARG PIP_VERSION="19.0.1"
ENV PIP_VERSION=${PIP_VERSION}
RUN echo "Pip version: ${PIP_VERSION}"

RUN pip install --upgrade pip==${PIP_VERSION}

# Airflow sources change frequently but dependency onfiguration won't change that often
# We copy setup.py and other files needed to perform setup of dependencies
# This way cache here will only be invalidated if any of the
# version/setup configuration change but not when airflow sources change
COPY --chown=airflow:airflow setup.py /opt/airflow/setup.py
COPY --chown=airflow:airflow setup.cfg /opt/airflow/setup.cfg

COPY --chown=airflow:airflow airflow/version.py /opt/airflow/airflow/version.py
COPY --chown=airflow:airflow airflow/__init__.py /opt/airflow/airflow/__init__.py
COPY --chown=airflow:airflow airflow/bin/airflow /opt/airflow/airflow/bin/airflow

# Airflow Extras installed
ARG AIRFLOW_EXTRAS="all"
ENV AIRFLOW_EXTRAS=${AIRFLOW_EXTRAS}
RUN echo "Installing with extras: ${AIRFLOW_EXTRAS}."

# First install only dependencies but no Apache Airflow itself
# This way regular changes in sources of Airflow will not trigger reinstallation of all dependencies
# And this Docker layer will be reused between builds.
RUN pip install --no-use-pep517 -e ".[${AIRFLOW_EXTRAS}]"

COPY --chown=airflow:airflow airflow/www/package.json /opt/airflow/airflow/www/package.json
COPY --chown=airflow:airflow airflow/www/package-lock.json /opt/airflow/airflow/www/package-lock.json

WORKDIR /opt/airflow/airflow/www
RUN npm install \
&& npm run prod

# Install necessary NPM dependencies (triggered by changes in package-lock.json)
RUN gosu airflow npm ci

COPY --chown=airflow:airflow airflow/www/ /opt/airflow/airflow/www/

# Package NPM for production
RUN gosu airflow npm run prod

WORKDIR /opt/airflow
RUN set -euxo pipefail \
&& apt update \
&& if [ -n "${BUILD_DEPS}" ]; then apt install -y $BUILD_DEPS; fi \
&& if [ -n "${PYTHON_DEPS}" ]; then pip install --no-cache-dir ${PYTHON_DEPS}; fi \
&& pip install --no-cache-dir --upgrade pip==19.0.1 \
&& pip install --no-cache-dir --no-use-pep517 -e .[$AIRFLOW_DEPS] \
&& apt purge --auto-remove -yqq $BUILD_DEPS \
&& apt autoremove -yqq --purge \
&& apt clean

WORKDIR $AIRFLOW_HOME
RUN mkdir -p $AIRFLOW_HOME
COPY scripts/docker/entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

# Always apt-get update/upgrade here to get latest dependencies before
# we redo pip install
RUN apt-get update \
&& apt-get upgrade -y --no-install-recommends \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Cache for this line will be automatically invalidated if any
# of airflow sources change
COPY --chown=airflow:airflow . /opt/airflow/

# Always add-get update/upgrade here to get latest dependencies before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
# Always add-get update/upgrade here to get latest dependencies before
# Always apt-get update/upgrade here to get latest dependencies before

# we redo pip install
RUN apt-get update \
&& apt-get upgrade -y --no-install-recommends \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this after copying source files seems out of place/not where I would have put this.

Is there a reason this is in this image rather than, say before the first COPY in this image, or in the airflow-apt-deps image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason was to make sure we always use latest dependencies and update frequently. I think that originated from our original (long time ago :) ) discussion with @Fokko - he was concerned that we will be "stuck" with some older version of apt-get dependencies, rather than latest ones.

There must be some trigger to invalidate such apt-get and we can play a bit with it.

I also don't find it particularly good to run apt-get upgrade after adding sources. I'd rather do it after copying "setup.py" and related files - this way we would always upgrade to latest apt dependencies together with updates to pip dependencies (which makes perfect sense).

This will be less frequent on one hand but also more stable (less risk that build starts failing because some dependency-triggered issue independent of code change). Similarly as in case of PIP dependencies that gives IMHO the best of both worlds - fairly frequent, automated update of dependencies and relative stability - the dependencies will always be "fixed" between a change in setup.py - so no more broken builds "out-of-the-blue".

Paired with regularly daily-cron-triggered fresh rebuild from the scratch, I think we might have perfect solution that detects dependency-triggered errors quickly but does not stop contributors from working and running Travis CI tests.

I think we are aligned on that @ashb and I am happy to move it there.

@Fokko - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to before the COPY --chown=airflow:airflow . /opt/airflow/ line then?


# Additional python deps to install
ARG ADDITIONAL_PYTHON_DEPS=""

RUN if [ -n "${ADDITIONAL_PYTHON_DEPS}" ]; then \
pip install ${ADDITIONAL_PYTHON_DEPS}; \
fi

USER airflow

WORKDIR ${AIRFLOW_HOME}

COPY --chown=airflow:airflow ./scripts/docker/entrypoint.sh /entrypoint.sh

EXPOSE 8080
ENTRYPOINT ["/usr/local/bin/dumb-init", "--", "/entrypoint.sh"]
CMD ["--help"]
27 changes: 27 additions & 0 deletions scripts/ci/ci_lint_dockerfile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
set -xeuo pipefail

MY_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

cd ${MY_DIR}/../../


docker run -v $(pwd)/Dockerfile:/root/Dockerfile -v $(pwd)/.hadolint.yaml:/root/.hadolint.yaml \
-w /root hadolint/hadolint /bin/hadolint Dockerfile
Loading