Skip to content

Create an ETL pipeline using Apache Airflow to manage the end-to-end data flow. The goal is to extract Bikeshare data from a public dataset (austin_bikeshare) in BigQuery, transform and store the data in Google Cloud Storage (GCS) in a partitioned format, and then create an external table in BigQuery to facilitate querying and analysis

Notifications You must be signed in to change notification settings

wagdySamy/austin-bikeshare-trips-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

austin-bikeshare-trips-etl-airflow

Create an ETL pipeline using Apache Airflow to manage the end-to-end data flow. The goal is to extract Bikeshare data from a public dataset (austin_bikeshare) in BigQuery, transform and store the data in Google Cloud Storage (GCS) in a partitioned format, and then create an external table in BigQuery to facilitate querying and analysis

Prerequisites

Before you begin, ensure you have the following installed:

Project Structure

austin-biketrips-etl
├── dags/
│   ├── bikeshare_etl.py
│   ├── bikeshare_etl.yaml
│   └── service_account.json
├── logs/
├── plugins/
├── Dockerfile
└── docker-compose.yaml

- dags/: Contains the DAG scripts, the YAML configuration file, and the Google Cloud service account JSON file.
- logs/: Directory for Airflow logs.
- plugins/: Directory for Airflow plugins.
- Dockerfile/: Defines the Docker image for the Airflow environment.
- docker-compose.yaml: Docker Compose file to set up Airflow services.

- You can replace austin-biketrips-etl with the working directory of docker.

Setup Instructions

Step 1: Clone the Repository

Clone this repository to your local machine:

Step 2: Configure Google Cloud Credentials

Place your Google Cloud service account JSON file in the dags/ directory and name it service_account.json. the service account should have the following roles: BigQuery Admin, Storage Admin, and Viewer

Step 3: Build the Docker Image

Build the Docker image using the provided Dockerfile:

docker-compose build

Step 4: Initialize the Airflow Database

Initialize the Airflow database:

docker-compose run airflow-init

Step 5: Start Airflow Services

Start the Airflow web server and scheduler:

docker-compose up -d

Step 6: Access the Airflow Web UI

Open your browser and go to http://localhost:8080 to access the Airflow web interface.

Step 7: DAG verification

  • DAG Graph image

  • Parquet files in GCS bucket

image
  • External BigLake Table
image

Step 8: Data Analysis queries

The \Data Analysis Script\austin_bikeshare_data_analysis_sql_script.sql, conatins a SQL queries to answer the following questions using the new BigLake table: ext_bikeshare_trips

  1. Find the total number of trips for each day.
  2. Calculate the average trip duration for each day.
  3. Identify the top 5 stations with the highest number of trip starts.
  4. Find the average number of trips per hour of the day.
  5. Determine the most common trip route (start station to end station).
  6. Calculate the number of trips each month.
  7. Find the station with the longest average trip duration.
  8. Find the busiest hour of the day (most trips started).
  9. Identify the day with the highest number of trips.

About

Create an ETL pipeline using Apache Airflow to manage the end-to-end data flow. The goal is to extract Bikeshare data from a public dataset (austin_bikeshare) in BigQuery, transform and store the data in Google Cloud Storage (GCS) in a partitioned format, and then create an external table in BigQuery to facilitate querying and analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published