Create an ETL pipeline using Apache Airflow to manage the end-to-end data flow. The goal is to extract Bikeshare data from a public dataset (austin_bikeshare) in BigQuery, transform and store the data in Google Cloud Storage (GCS) in a partitioned format, and then create an external table in BigQuery to facilitate querying and analysis
Before you begin, ensure you have the following installed:
- Docker: Install Docker
- Docker Compose: Install Docker Compose
austin-biketrips-etl
├── dags/
│ ├── bikeshare_etl.py
│ ├── bikeshare_etl.yaml
│ └── service_account.json
├── logs/
├── plugins/
├── Dockerfile
└── docker-compose.yaml
- dags/: Contains the DAG scripts, the YAML configuration file, and the Google Cloud service account JSON file.
- logs/: Directory for Airflow logs.
- plugins/: Directory for Airflow plugins.
- Dockerfile/: Defines the Docker image for the Airflow environment.
- docker-compose.yaml: Docker Compose file to set up Airflow services.
- You can replace austin-biketrips-etl with the working directory of docker.
Clone this repository to your local machine:
Place your Google Cloud service account JSON file in the dags/
directory and name it service_account.json
.
the service account should have the following roles: BigQuery Admin, Storage Admin, and Viewer
Build the Docker image using the provided Dockerfile
:
docker-compose build
Initialize the Airflow database:
docker-compose run airflow-init
Start the Airflow web server and scheduler:
docker-compose up -d
Open your browser and go to http://localhost:8080 to access the Airflow web interface.
- External BigLake Table
The \Data Analysis Script\austin_bikeshare_data_analysis_sql_script.sql, conatins a SQL queries to answer the following questions using the new BigLake table
: ext_bikeshare_trips
- Find the total number of trips for each day.
- Calculate the average trip duration for each day.
- Identify the top 5 stations with the highest number of trip starts.
- Find the average number of trips per hour of the day.
- Determine the most common trip route (start station to end station).
- Calculate the number of trips each month.
- Find the station with the longest average trip duration.
- Find the busiest hour of the day (most trips started).
- Identify the day with the highest number of trips.