create statewide analysis tables in Google BigQuery #31

hunterowens · 2021-03-19T20:51:19Z

Currently, after we download the data, we should aggregate out each of the GTFS fields into "statewide" tables suitable for analysis.

stops
schedule / stop times
routes

The text was updated successfully, but these errors were encountered:

machow · 2021-04-05T16:00:51Z

Some thoughts, for once we merge ##51.

Background

AFAIK the warehouse will have 2 jobs here:

storing reports on gtfs data collection - e.g. agencies being ingested and validation results.
storing gtfs data itself - e.g. using table schemas like in https://github.com/remix/partridge
currently data is stored using the path schedule/{execution_date}/{itp_id}/{url_number}/[{table_name} or validation.json]

Proposal (in order of priority)

Note that everything below assumes we'd be loading only the most current day (e.g. for that day). That might be the easiest place to start / investigate kicking things of with bigquery. For keeping the full history, and updating incrementally, we'd need to add something like execution_date to the primary keys listed below.

Load agencies.yml agencies in tabled form (e.g. gtfs_agency)
- Nest agency urls, so can be transformed into next table
Load agencies.yml gtfs urls in tabled form (e.g. gtfs_data_agency_urls)
- PK: itp_id, url_number
Load validation results
- PK: itp_id, url_number
- data column holding JSON
- this will let us get started with the tricky work of modeling/transforming gtfs-validator results
Load gtfs data itself
- e.g. use a model like https://github.com/remix/partridge
- PK: itp_id, url_number, plus whatever e.g. PK partridge uses (e.g. agency_id for the agency table)
- will need a strategy for tables with different validation results (e.g. should we load those failing?)

e-lo · 2021-04-05T16:59:26Z

[not sure if this is helpful/relevant]
There is a GTFS Frictionless data schema and there is [new and untested] Pandera support for validating frictionless schemas.

machow · 2021-04-05T17:12:49Z

@e-lo these are super helpful--thanks! I'll definitely use these repos (and scan their issues) to try and figure out how to load the data / common snags people hit :o.

machow · 2021-04-12T16:00:52Z

Going to try loading schedules over the next day. Now that I've dug a big more into some of the schema formats out there--it seems like taking a two-step strategy is useful. That is..

This GTFS frictionless data schema defines two levels of validation:

needed to load into warehouse (e.g. column types)
needed for transforming data (e.g. valid stop lattitudes)

A big advantage of separating these out, is we can run validations for step 2 directly inside the warehouse, and avoid looping over files, expose the task to analysts, etc..

hunterowens · 2021-05-03T20:47:17Z

also @machow should we close?

machow · 2021-05-05T18:39:46Z

ah, yeah!

hunterowens assigned machow Mar 29, 2021

hunterowens mentioned this issue Apr 12, 2021

export statewide tables to data.ca.gov #57

Closed

machow closed this as completed May 5, 2021

This was referenced Apr 19, 2022

Msd maps charts cal-itp/data-analyses#51

Merged

More MSD cal-itp/data-analyses#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create statewide analysis tables in Google BigQuery #31

create statewide analysis tables in Google BigQuery #31

hunterowens commented Mar 19, 2021

machow commented Apr 5, 2021 •

edited

Loading

e-lo commented Apr 5, 2021

machow commented Apr 5, 2021 •

edited

Loading

machow commented Apr 12, 2021

hunterowens commented May 3, 2021

machow commented May 5, 2021

create statewide analysis tables in Google BigQuery #31

create statewide analysis tables in Google BigQuery #31

Comments

hunterowens commented Mar 19, 2021

machow commented Apr 5, 2021 • edited Loading

Background

Proposal (in order of priority)

e-lo commented Apr 5, 2021

machow commented Apr 5, 2021 • edited Loading

machow commented Apr 12, 2021

hunterowens commented May 3, 2021

machow commented May 5, 2021

machow commented Apr 5, 2021 •

edited

Loading

machow commented Apr 5, 2021 •

edited

Loading