Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTFS Schedule optimizations: require partition filters #2138

Merged
merged 6 commits into from
Feb 15, 2023

Conversation

atvaccaro
Copy link
Contributor

@atvaccaro atvaccaro commented Jan 9, 2023

Description

NOTE: must wait until #2103 is merged this is merged

Closes #2134

This is non-trivial to test for correctness because of a lack of true primary keys in the final dimension tables, but I've verified re-running this does not produce duplicate rows. The underlying macro first filters the possible GTFS data by day, but still inner joins against feed versions.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation
  • agencies.yml

How has this been tested?

Testing as best I can, though we may still have some queries that fail. We can monitor the query access logs for failures.

-m +mart.gtfs_schedule_latest run/test passes as much as we would expect. dim_translations fails, we may just want to disable it?

Screenshots (optional)

@atvaccaro atvaccaro self-assigned this Jan 9, 2023
@atvaccaro atvaccaro changed the title just require partition filters everywhere require partition filters and adjust accordingly Jan 9, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch 2 times, most recently from 0a9644c to e5da898 Compare January 9, 2023 22:57
@atvaccaro atvaccaro marked this pull request as ready for review January 10, 2023 16:46
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from af0fe71 to 1a4d8ce Compare January 10, 2023 17:07
@atvaccaro atvaccaro added the do-not-merge Do not merge, even if approved label Jan 10, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch 2 times, most recently from 6faf0ab to 51e3b8e Compare January 13, 2023 20:06
@atvaccaro atvaccaro removed the do-not-merge Do not merge, even if approved label Jan 13, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from b2750a7 to 3fc865e Compare January 13, 2023 22:10
Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sold on the need for this. I think that we can afford to run all of history for most schedule files with no problems, and I am particularly unsure that 120 days is enough for schedule. Like I think that when we do a full refresh we will basically always end up overriding to get all of history since 2021-04-16.

Can you add more documentation in the ticket or here about the rationale for doing this, and how you picked the defaults? Personally I think that having a mechanism for different lookback between prod and dev is actually more important than changing prod.

@@ -1,12 +1,39 @@
{% macro make_schedule_file_dimension_from_dim_schedule_feeds(dim_schedule_feeds, gtfs_file_table) %}
{{ config(materialized='incremental') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit weird to me to have the materialization in a macro. I guess it is ok? But I am a bit 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a bit weird but ideally we just force anything using this macro to be incremental, since the macro itself only works in an incremental model.

Multiple calls to config are explicitly supported, luckily dbt-labs/dbt-core#1150

@atvaccaro
Copy link
Contributor Author

I am still not sold on the need for this. I think that we can afford to run all of history for most schedule files with no problems, and I am particularly unsure that 120 days is enough for schedule. Like I think that when we do a full refresh we will basically always end up overriding to get all of history since 2021-04-16.

Yes that's fair, I'll change this at minimum.

Can you add more documentation in the ticket or here about the rationale for doing this, and how you picked the defaults? Personally I think that having a mechanism for different lookback between prod and dev is actually more important than changing prod.

The immediate cause of this work was Metabase fully scanning stop_times for purposes of auto-complete; even though it queries individual columns, any query that hits an external table must read the full file, and Metabase was doing no filtering on dt (which we shouldn't expect it to do). More broadly, I want us to start protecting our external tables from expensive queries that we may not realize are occurring without looking at audit logs. Even if we read in all of Schedule history (i.e. set the lookback to 2022-04-16), we'd still be protecting ourselves against expensive reads on stop_times, etc. via Metabase/Jupyter/whatever, human or computer. The fact we then have to filter in dbt is more a consequence than a goal; if you would prefer, I can split this PR into just requiring filters with a dummy where in dbt, and either skip incremental materializations or do them in a separate PR.

@atvaccaro atvaccaro added the do-not-merge Do not merge, even if approved label Jan 18, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 3fc865e to fa7196f Compare January 19, 2023 16:29
@atvaccaro atvaccaro changed the base branch from main to historical-airtable-try2 January 19, 2023 16:29
@atvaccaro atvaccaro changed the title require partition filters and adjust accordingly require partition filters and adjust models and tests accordingly Jan 19, 2023
Base automatically changed from historical-airtable-try2 to main January 19, 2023 18:30
@SorenSpicknall
Copy link
Contributor

NOTE: must wait until #2138 is merged

Is this supposed to be self-referencing, or is there another PR that should be referenced as a dependency here?

@atvaccaro
Copy link
Contributor Author

NOTE: must wait until #2138 is merged

Is this supposed to be self-referencing, or is there another PR that should be referenced as a dependency here?

Ah no good catch, it should've referenced #2103

@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 440194b to a3e4849 Compare January 19, 2023 22:52
@atvaccaro
Copy link
Contributor Author

atvaccaro commented Jan 20, 2023

Here's some example queries re: Slack discussion.

select *
from cal-itp-data-infra-staging.andrew_mart_gtfs.dim_stop_times
where '2023-01-19' between _valid_from and _valid_to; -- 460 gb

select *
from cal-itp-data-infra-staging.andrew_mart_gtfs.dim_stop_times_clustered_by_valids
where '2023-01-19' between _valid_from and _valid_to; -- 460 gb

select *
from cal-itp-data-infra-staging.andrew_mart_gtfs.dim_stop_times_clustered_by_valids
where _valid_from < '2023-01-19' AND '2023-01-19' < _valid_to; -- 13 gb

select *
from cal-itp-data-infra.mart_gtfs.dim_stop_times st
inner join cal-itp-data-infra.mart_gtfs.dim_schedule_feeds sf
  on st.feed_key = sf.key
where '2023-01-19' between sf._valid_from and sf._valid_to; -- 461 gb

select *
from cal-itp-data-infra-staging.andrew_mart_gtfs.dim_stop_times_clustered_by_feed_key st
inner join cal-itp-data-infra.mart_gtfs.dim_schedule_feeds sf
  on st.feed_key = sf.key
where '2023-01-19' between sf._valid_from and sf._valid_to; -- 33 gb

@atvaccaro atvaccaro changed the title require partition filters and adjust models and tests accordingly GTFS Schedule optimizations: require partition filters and change our clustering on dimension tables Jan 20, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 5f4734c to eeec617 Compare January 24, 2023 02:24
@atvaccaro atvaccaro removed the do-not-merge Do not merge, even if approved label Jan 24, 2023
Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still YAML in _int_gtfs for int_gtfs_schedule__incremental_stop_times and the others (GH won't let me comment inline), can those be deleted?

warehouse/dbt_project.yml Outdated Show resolved Hide resolved
warehouse/macros/get_latest_schedule_data.sql Outdated Show resolved Hide resolved
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 4667eed to 147af26 Compare January 25, 2023 15:51
Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see response on comment on the macro

@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 147af26 to 67d12ea Compare January 26, 2023 20:22
@atvaccaro atvaccaro added the do-not-merge Do not merge, even if approved label Jan 27, 2023
@atvaccaro
Copy link
Contributor Author

Going to wait post-reports since this changes external tables and has the potential to cause unexpected problems.

@atvaccaro atvaccaro force-pushed the require-partition-filters branch 2 times, most recently from a1fdeb8 to 0eaa17a Compare January 31, 2023 18:18
@atvaccaro atvaccaro removed the do-not-merge Do not merge, even if approved label Jan 31, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 0eaa17a to 0c20180 Compare January 31, 2023 18:46
@atvaccaro atvaccaro changed the title GTFS Schedule optimizations: require partition filters and change our clustering on dimension tables GTFS Schedule optimizations: require partition filters Jan 31, 2023
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 0c20180 to 2af659b Compare February 8, 2023 21:45
@atvaccaro atvaccaro force-pushed the require-partition-filters branch from 2af659b to 277911d Compare February 15, 2023 19:06
@atvaccaro atvaccaro merged commit 45b159d into main Feb 15, 2023
@atvaccaro atvaccaro deleted the require-partition-filters branch February 15, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change GTFS Schedule external tables to require partition filters
3 participants