-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt snapshots to handle append tables #3878
Comments
@mpavillet This is a really interesting issue. Thanks for the thoughtful & thorough write-up! In some sense, you're actually in a better position than most folks using snapshots. Those folks need to use snapshots, because they don't have access to a full roster of historical values; you have them in spades. For me, this begs a question: which is the more important feature of snapshots?
I think it's the latter. The logic underpinning the If you have guaranteed access to all historical values, with a full table copy from every single day, and you want to summarize those values into slowly changing dimensions—you can write the SQL for it, and wrap that SQL in an incremental model. That SQL is not trivial, but (I think) it's also less complicated than you'd expect. Plus, that incremental model can include logic to dynamically grab only the newest days of data, dynamically determined based on a bookmark from the model's own last run—so there's no need to manually reprocess day by day. If at any time you need to change the logic, you can Here's some (untested) code I'm imagining for that model: {{ config(
materialized = 'incremental',
unique_key = 'scd_id',
merge_update_columns = ['valid_to']
) }}
with all_or_new as (
select * from {{ source('my_date_clustered', 'source_table') }}
{% if is_incremental() %}
-- get the latest day of data from when this model was last run.
-- always reprocess that day and the day before (just to be safe)
-- plus anything newer, of course
where date_day >= {{ dbt_utils.dateadd(
datepart='day', interval=1, from_date_or_timestamp= '(select max(updated_at) from {{ this }})'
) }}
{% endif %}
),
hashed as (
select *,
{{ dbt_utils.surrogate_key(['id', 'updated_at']) }} as scd_id
from all_or_new
),
windowed as (
select *,
min(date_day) over (partition by scd_id) as valid_from,
-- valid_to should be null if this scd_id appears in the most recent day of data
-- (i.e. the record is still valid)
nullif(
max(date_day) over (partition by scd_id),
max(date_day) over ()
) as valid_to
from windowed
group by 1
)
select * from windowed
-- deduplicate: only take the latest day of data for each scd_id
qualify row_number() over (partition by scd_id order by date_day) = 1 There are a few significant things to call out here:
Curious to hear what you think! Does that logic feel unwieldy? Have I missed a few thorny details of your specific use case? |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers. |
can you revise your logic @jtcohen6 as i don't think it correct We need to update the old row (valid_to) and insert the new one. I don't see the logic here. You may try to update the old one but how about the new one?. And the way you create sk is difficult to understand. on the other hand from windowed as is select its self, it should be |
Is this the same issue as the one described here? |
Describe the feature
dbt snapshot feature to handle scd2 changes when a give id exists several times (data is appended to the source table, as opposed to overwritten). Currenty, dbt snapshots assume that:
Describe alternatives you've considered
The way our data lake works (dbt reads from the data lake and writes to our data warehouse) - those are two distinct databases in Snowflake.
We get a FULL copy of the prod table every day, clustered by date. Thus, we have a full table every day. The delta between day and day-1 is what we need for scd2. The only way I’ve made it work so far is: that snapshots pull from the latest version of the table in prod.
Problem: if we lose a day or two, we need to re-run for the missing days, which I’ve made work by passing a variable like dbt snapshot --var ref_date = 'my date'
Sadly, snapshots in dbt don’t allow to iterate over an array of dates
Additional context
It's not db specific but snowflake does throw this
ERROR_ON_NONDETERMINISTIC_MERGE
if an id exists several times.When set to false, Snowflake doesn't throw any error but does not handle scd2 changes properly. (see dates below):
Who will this benefit?
Anyone who wishes to have scd2 snapshots derived from append tables or create a scd2 table off off an incremental table where an id can exists several time.
Are you interested in contributing this feature?
Yes. I am happy to look into the SQL, what needs to be done, update the macro.
The text was updated successfully, but these errors were encountered: