Skip to content
This repository has been archived by the owner on May 19, 2022. It is now read-only.

Orchestration of data processing tasks to power the reporting TPOT ETP API

License

Notifications You must be signed in to change notification settings

workforce-data-initiative/tpot-airflow

Repository files navigation

tpot-airflow

Orchestration of data processing tasks to power the reporting TPOT ETP API

License Python 3 Updates CircleCI

Table of Contents

Installation

  1. Clone the project and cd into the folder.

    git clone https://github.com/workforce-data-initiative/tpot-airflow.git && cd tpot-airflow

    To test it out real quick using Docker just run:

    docker-compose up

    and explore the UI at localhost:8080.

    Then run the scheduler in that same container

    docker-compose exec web airflow scheduler
  2. Install requirements (preferably in a virtual environment)

    pip install -r requirements.txt

    Note that the project is using Python 3.6.2 in development

  3. Prepare the home for airflow:

    export AIRFLOW_HOME=$(pwd)

Usage

Follow through steps 1 to 3:

Running sh setup.sh is step 1, 2 and 3 in a single script. Then get to localhost:8080.

  1. Initialize the meta database by running:

    airflow initdb
  2. Setup airflow:

    python config/remove_airflow_examples.py
    airflow resetdb -y
    export APP=TPOT [or some other name] (Optional)
    python config/customize_dashboard.dev.py (Optional)

Running python customize_dashboard.dev.py customizes the dashboard to read TPOT - Airflow instead of Airflow

  1. Start the airflow webserver and explore the UI at localhost:8080.
    airflow webserver

Note that you have optional arguments:

  • -p=8080, --port=8080 to specify which port to run the server
  • -w=4, --workers=4 to specify the number of workers to run the webserver on

Deployment

Docker

RUN docker build -t tpot-airflow -f Dockerfile.dev .

Heroku

RUN sh heroku.sh

AWS EC2

  1. Setup an EC2 instance in AWS (ensure that you download the .pem file)

  2. Authorise inbound traffic for this instance by adding a rule to the security group to accept traffic on port 8080 (explained here)

  3. Connect to the instance via ssh (explained here).

    Run the following:

    • sudo yum install git
    • git clone https://github.com/workforce-data-initiative/tpot-airflow.git
    • cd tpot-airflow
    • sh aws_setup.sh
    • sh docker_setup.sh
    • logout - then ssh into the container again to pick up the new docker group permissions
    • tmux
    • docker-compose up -d

    It is advised that the codebase is modified in Github. Pull any update done to the codebase by running:

    • git pull origin master - or the relevant branch

For you to ssh into an already running instance, ask for the .pem and run:

ssh -i "<>.pem" ec2-user@<Public DNS>

For example: ssh -i "airflow.pem" ec2-user@random.compute-1.amazonaws.com

You'll need to ssh to setup keys intentionally not included on the codebase.

Support

Please confirm if the issue has not been raised then you can open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

About

Orchestration of data processing tasks to power the reporting TPOT ETP API

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published