Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create statewide analysis tables in Google BigQuery #31

Closed
hunterowens opened this issue Mar 19, 2021 · 6 comments
Closed

create statewide analysis tables in Google BigQuery #31

hunterowens opened this issue Mar 19, 2021 · 6 comments
Assignees

Comments

@hunterowens
Copy link
Member

Currently, after we download the data, we should aggregate out each of the GTFS fields into "statewide" tables suitable for analysis.

  1. stops
  2. schedule / stop times
  3. routes
@machow
Copy link
Contributor

machow commented Apr 5, 2021

Some thoughts, for once we merge ##51.

Background

AFAIK the warehouse will have 2 jobs here:

  • storing reports on gtfs data collection - e.g. agencies being ingested and validation results.
  • storing gtfs data itself - e.g. using table schemas like in https://github.com/remix/partridge
  • currently data is stored using the path schedule/{execution_date}/{itp_id}/{url_number}/[{table_name} or validation.json]

Proposal (in order of priority)

Note that everything below assumes we'd be loading only the most current day (e.g. for that day). That might be the easiest place to start / investigate kicking things of with bigquery. For keeping the full history, and updating incrementally, we'd need to add something like execution_date to the primary keys listed below.

  • Load agencies.yml agencies in tabled form (e.g. gtfs_agency)
    • Nest agency urls, so can be transformed into next table
  • Load agencies.yml gtfs urls in tabled form (e.g. gtfs_data_agency_urls)
    • PK: itp_id, url_number
  • Load validation results
    • PK: itp_id, url_number
    • data column holding JSON
    • this will let us get started with the tricky work of modeling/transforming gtfs-validator results
  • Load gtfs data itself
    • e.g. use a model like https://github.com/remix/partridge
    • PK: itp_id, url_number, plus whatever e.g. PK partridge uses (e.g. agency_id for the agency table)
    • will need a strategy for tables with different validation results (e.g. should we load those failing?)

@e-lo
Copy link
Contributor

e-lo commented Apr 5, 2021

[not sure if this is helpful/relevant]
There is a GTFS Frictionless data schema and there is [new and untested] Pandera support for validating frictionless schemas.

@machow
Copy link
Contributor

machow commented Apr 5, 2021

@e-lo these are super helpful--thanks! I'll definitely use these repos (and scan their issues) to try and figure out how to load the data / common snags people hit :o.

@machow
Copy link
Contributor

machow commented Apr 12, 2021

Going to try loading schedules over the next day. Now that I've dug a big more into some of the schema formats out there--it seems like taking a two-step strategy is useful. That is..

This GTFS frictionless data schema defines two levels of validation:

  • needed to load into warehouse (e.g. column types)
  • needed for transforming data (e.g. valid stop lattitudes)

A big advantage of separating these out, is we can run validations for step 2 directly inside the warehouse, and avoid looping over files, expose the task to analysts, etc..

@hunterowens
Copy link
Member Author

also @machow should we close?

@machow
Copy link
Contributor

machow commented May 5, 2021

ah, yeah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants