Skip to content
This repository has been archived by the owner on Oct 24, 2021. It is now read-only.
/ prod-airflow Public archive

Helping you get Airflow running in production.

License

Notifications You must be signed in to change notification settings

r-kells/prod-airflow

Repository files navigation

prod-airflow

CircleCI Docker Build Status Docker Build Status

Docker Pulls Docker Stars

prod-airflow is designed to help get you started running Airflow in production.

This repository was originally forked from Puckel's docker-airflow repository.

Features

  • Unit / integration testing with Docker.
    • Included a smoke test that checks the basics of all your DAG's.
    • Easily add more tests of your own.
  • Pre-made Airflow DAG's and charts to monitor Airflow performance and uptime.
  • Easy debugging a production-like environment using docker-compose.
  • Basic authentication setup for running in production.
  • Makefile for easy docker commands.

Installation

Pull the image from the Docker repository.

docker pull rkells/prod-airflow

Pull a specific version. The image version uses the format <airflow version>-<prod-airflow version>.

docker pull r-kells/prod-airflow:1.10.3-0.0.1

Makefile Configuration Options

  • ENV_FILE
  • EXECUTOR

ENV_FILE: Environment Variable Handling

We use .env files to manage docker environment variables. This is configurable through specifying the environment variable ENV_FILE. The default file is dev.env, also included prod.env

make <command> ENV_FILE=prod.env

EXECUTOR: Executor Type

The default executor type is LocalExecutor for make test and make debug

make <command> EXECUTOR=Celery

Build

make build 

Optionally install Extra Airflow Packages. Please modify the dockerfile.

Test

The Dockerfile mounts your /test, /dags and /plugins directories to $AIRFLOW_HOME. This helps run your tests in a similar environment to production.

By default, we use the docker-compose-LocalExecutor.yml to start the webserver and scheduler in the same container, and Postgres in another.

Therefore you can easily have tests that interact with the database.

	make test

To use the Celery Executor:

make test EXECUTOR=Celery

Included tests

Coverage

make test runs unittests with coverage, then prints the results.

Debug

Similar to testing, we run airflow with docker-compose to replicate a production environment.

make debug
# inspect logs
docker logs -f <containerId>
# jump into the running container
docker exec -it <containerId> bash

To debug the CeleryExecutor:

make debug EXECUTOR=Celery

Run

By default, docker-airflow runs Airflow with SequentialExecutor. This can be modified by configuring the executor within an .env file. Keep in mind, if you change the EXECUTOR=Local or Celery, the entrypoint.sh will expect a database connection to be available.

To start the container in detached mode:

make run 

To run an arbitrary airflow command on the image:

make cmd SERVICE="airflow list_dags"

Monitoring

The init_airflow.py automatically sets up airflow Charts for monitoring.

Airflow Charts: localhost:8080/admin/chart/

The Canary DAG

The canary DAG runs every 5 minutes.

It should have a connection check with a simple SQL query (e.g.” SELECT 1”) for all the critical data sources. By default its connected to the default postgres setup.

The “canary” DAG helps to answer the following questions:

  • Do all critical connections work.
  • How long it takes for the Airflow scheduler to schedule the task (scheduled execution_time — current_time).
  • How long the task runs.

Configurating Airflow

Environment Variables

Add Airflow ENV variables to .env files and reference them with docker See Airflow documentation for more details

Fernet Key

For encrypted connection passwords (in Local or Celery Executor), you must have the same fernet_key. By default docker-airflow generates the fernet_key at startup, you have to set an environment variable in the docker-compose (ie: docker-compose-LocalExecutor.yml) file to set the same key accross containers. To generate a fernet_key :

docker run rkells/prod-airflow python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"

Authentication

The config prod.env enables basic password authentication for Airflow. Even if you are behind other security walls, this authentication is useful because of the ability to filter DAGs by owner.

See the documentation for setup and other details.

Ad hoc query / Connections

If you want to use Ad hoc query, make sure you've configured connections: By default the DAG init_airflow.py will setup a connection to postgres.

To add other connections: Go to Admin -> Connections and Edit: set the values (equivalent to values in airflow.cfg/docker-compose*.yml) :

  • Host : postgres
  • Schema : airflow
  • Login : airflow
  • Password : airflow

The init_airflow.py DAG

The init_airflow.py DAG runs once and is intended to help configure airflow to bootstrap a new installation, or setup for testing.

As currently configured:

  1. Creates a connection to postgres called my_postgres from the `Ad-hoc query UI.
  2. Creates a pool mypool with 10 slots.
  3. Creates the monitoring charts.

You are encouraged to extend this DAG for reproducible setup.

Custom Airflow plugins

Documentation on plugins can be found here

An example plugin can be found here, along with it's unit tests.

Install custom python package

  • Create a file "requirements.txt" with the desired python modules
  • The entrypoint.sh script will execute the pip install command (with --user option)

Alternatively, build your image with your desired packages