This repo contains an example dbt project that can be run locally using public transit GTFS data, to help people accustomed to working with transit data become familiar with dbt. The functionality in this repo is designed to run entirely locally, without needing any cloud infrastructure, to focus specifically on learning dbt.
The project generally follows dbt's guide on how to structure a dbt project.
- Python 3.11+ (see Python Beginner's Guide)
- uv package and virtual environment manager (see uv instructions)
- dbt 1.9+ (see dbt instructions)
- DuckDB CLI (see DuckDB instructions)
-
Clone this repository
-
Initialize virtual environment and install dependencies from lockfile:
uv venv
uv pip sync
- Install dbt packages:
uv run dbt deps
The project is configured to use a local DuckDB database file. No additional database configuration is required.
To verify your setup:
uv run dbt debug
Before running dbt models, you need to load GTFS data into your local DuckDB database:
- Run the load_gtfs.py script with a GTFS feed URL or local filepath (to the zipped GTFS feed):
uv run python load_gtfs.py "[your GTFS feed URL or filepath]"
You can optionally include a feed_name
for labeling purposes:
uv run python load_gtfs.py "[your GTFS feed URL or filepath]" --feed_name "my_feed"
This will:
- Download the GTFS feed
- Create a local DuckDB database file (transit_dbt_sandbox.duckdb)
- Load core GTFS tables (routes, agency, trips, stops, stop_times, calendar, and calendar_dates) into the 'raw' schema in that DuckDB database
- Verify the data load using DuckDB CLI:
duckdb transit_dbt_sandbox.duckdb -c "SELECT count(*) FROM raw.routes;"
- Build dbt to run and test all the defined models:
uv run dbt build
- To explore the models that exist, use the dbt docs, which you can generate and serve locally as follows:
uv run dbt docs generate && uv run dbt docs serve
By default, the load script will overwrite the raw data tables if they already exist and the entire DuckDB database will be dropped, so any dbt-created views and tables will be removed.
If you want to append new data into the database while keeping the data you've already loaded, use the --append
flag: uv run python load_gtfs.py --append [your additional GTFS feed URL or filepath]
.
If you run with --append
, the new feed will be loaded into the raw
schema but dbt will not be run automatically, so your raw schema and any dbt-created tables will be out of sync.
Loaded feeds are listed in the raw.feed_metadata
table in the DuckDB database.
Once you have the project set up, please use it as your sandbox! Make changes to models, add tests, etc. Below are some handy commands you can use to test your work and some concepts you may want to test out.
See the dbt CLI reference for a guide to available dbt commands. Preface commands with uv run
to use the uv virtual environment in the repo. For example, you can:
- Run all models:
uv run dbt run
- Run specific models:
uv run dbt run --select staging
uv run dbt run --select marts
- Test your models:
uv run dbt test
You can query the DuckDB database as follows:
duckdb transit_dbt_sandbox.duckdb -c "[your query here, ex: SELECT * FROM raw.agency;]"
Some things you could try out:
- Add some additional tests: What are some data integrity guarantees that are not tested yet?
- Create new models (tables/views): For example, try making a new mart model that summarizes number of trips per service date per route with first and last arrival times.
- Enhance the functionality: For example, the repo does not actually correctly implement the use of
agency.agency_timezone
to localize the dates and times in the feed. - Test out incremental logic: The
fct_service_stops
andfct_service_stops_incremental
models are identical, except that the second one is incremental and has an_insert_time
column recording the time when records were inserted. You can use these models to try out different approaches to making a model incremental.
Note
At time of writing, the DuckDB dbt adapter only supports the append
and delete+insert
incremental strategies.