-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GTFS Schedule optimizations: require partition filters #2138
Conversation
0a9644c
to
e5da898
Compare
af0fe71
to
1a4d8ce
Compare
6faf0ab
to
51e3b8e
Compare
b2750a7
to
3fc865e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sold on the need for this. I think that we can afford to run all of history for most schedule files with no problems, and I am particularly unsure that 120 days is enough for schedule. Like I think that when we do a full refresh we will basically always end up overriding to get all of history since 2021-04-16
.
Can you add more documentation in the ticket or here about the rationale for doing this, and how you picked the defaults? Personally I think that having a mechanism for different lookback between prod and dev is actually more important than changing prod.
@@ -1,12 +1,39 @@ | |||
{% macro make_schedule_file_dimension_from_dim_schedule_feeds(dim_schedule_feeds, gtfs_file_table) %} | |||
{{ config(materialized='incremental') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit weird to me to have the materialization in a macro. I guess it is ok? But I am a bit 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a bit weird but ideally we just force anything using this macro to be incremental, since the macro itself only works in an incremental model.
Multiple calls to config are explicitly supported, luckily dbt-labs/dbt-core#1150
Yes that's fair, I'll change this at minimum.
The immediate cause of this work was Metabase fully scanning stop_times for purposes of auto-complete; even though it queries individual columns, any query that hits an external table must read the full file, and Metabase was doing no filtering on |
3fc865e
to
fa7196f
Compare
Is this supposed to be self-referencing, or is there another PR that should be referenced as a dependency here? |
440194b
to
a3e4849
Compare
Here's some example queries re: Slack discussion.
|
5f4734c
to
eeec617
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still YAML in _int_gtfs
for int_gtfs_schedule__incremental_stop_times
and the others (GH won't let me comment inline), can those be deleted?
4667eed
to
147af26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see response on comment on the macro
147af26
to
67d12ea
Compare
Going to wait post-reports since this changes external tables and has the potential to cause unexpected problems. |
a1fdeb8
to
0eaa17a
Compare
0eaa17a
to
0c20180
Compare
0c20180
to
2af659b
Compare
2af659b
to
277911d
Compare
Description
NOTE: must wait until #2103 is mergedthis is mergedCloses #2134
This is non-trivial to test for correctness because of a lack of true primary keys in the final dimension tables, but I've verified re-running this does not produce duplicate rows. The underlying macro first filters the possible GTFS data by day, but still inner joins against feed versions.
Type of change
How has this been tested?
Testing as best I can, though we may still have some queries that fail. We can monitor the query access logs for failures.
-m +mart.gtfs_schedule_latest
run/test passes as much as we would expect.dim_translations
fails, we may just want to disable it?Screenshots (optional)