This project leverages Apache Airflow to ingest NYC taxi data, perform simple transformations, load raw data into Google Cloud Storage (GCS), and create a table in BigQuery.
Airflow is run locally in a Docker container using a modified, lighter version of the official docker-compose.yaml
. Unnecessary services have been removed to enable seamless local execution without continuous restarts.
- You will need a Google Cloud Storage bucket and a BigQuery dataset in your Google Cloud project.
- If you don’t have these set up, refer to my other Terraform GCP repo for a quick and easy setup using a few commands.
- Save your GCP credentials in the root directory under
/.google/credentials/
for simplicity.
- Clone this repository and create a
.env
file based on the.env.example
. - Run the following command to create a directory for Airflow logs:
mkdir -p ./logs
- Build the Docker containers:
docker compose build
- Initialize the Airflow services:
docker compose up airflow-init
- After the previous command completes, run the services with:
docker compose up
Open another terminal and check if everything is running smoothly:
docker compose ps
Open your browser and navigate to http://localhost:8080/.
Log in using the default credentials (username: airflow
, password: airflow
), unless you changed them in the docker-compose.yaml
file.
- In the Airflow UI, find the DAG, click on it, and trigger it using the button on the top right.
- If any part of the pipeline fails, you can check the logs to debug the issue.