Replies: 16 comments 28 replies
-
|
Beta Was this translation helpful? Give feedback.
-
Regarding the naïve timestamp approach. I feel this is a pretty big problem because it can result in the wrong data being written. In our case, we use Fivetran to load Snowflake. We use a timestamp snapshot strategy with the _FIVETRAN_SYNCED column as the updated_at config. We also include "where _fivetran_deleted = FALSE" on our snapshots so when the row is deleted in the source the snapshot will indicate the row is no longer valid. The _FIVETRAN_SYNCED column is defined as a timestamp_tz and therefore when the initial snapshot occurs the dbt_valid_from and dbt_valid_to columns are also defined as timestamp_tz. When a record is then deleted from the source, the dbt snapshots sets the dbt_valid_to column to snapshot_get_time(). Snapshot_get_time() however returns a timestamp_ntz data type in the UTC time zone. This UTC time is then implicitly converted to a timestamp_tz which stores the UTC value and adds the time zone offset of the session, which in our case is currently -0500. This time is now 5 hours in the future, which is not the correct time. As mentioned I feel this a bug and should be addressed in dbt-core. Some ideas. I realize some of these ideas are breaking changes so maybe a new snapshot strategy should be introduced that allows users to make the switch when ready. Workarounds We found that overriding the snapshot_get_time() was the best work around for us. I don't like the solution and it is a little risky but we changed the Snowflake command to: to_varchar(convert_timezone('UTC', current_timestamp()),'YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM') This converts the timestamp to a varchar which retains the offset information so when it is stored to timestamp_tz or timestamp_ltz the offset is correct. When stored to a timestamp_ntz column it is also correct and in UTC because the offset is trimmed away. As mentioned this is risky because I am not sure all the places snapshot_get_time() is used and changing the data type could be problematic. |
Beta Was this translation helpful? Give feedback.
-
@graciegoheen & @dbeatty10 - A few more features to add to the list, as requested by some of our high profile dbt-ers...
|
Beta Was this translation helpful? Give feedback.
-
TL;DR: Snapshots do not support renaming their metafields (e.g. Adding another feature here - some users want to customize the dbt metafields - for example, modify:
Some more advanced dbt-users may even know of the "trick" of modifying dbt's built in behaviour of materializations that come out of the box - i.e. overwrite the macros that come with dbt (https://github.com/dbt-labs/dbt-core/tree/f65e4b6940a0775a0c7fca1a54d9754ef954e926/core/dbt/include/global_project/macros/materializations/snapshots) - however, this DOES NOT WORK because dbt spefically checks for the existence of the columns ( dbt-core/core/dbt/adapters/base/impl.py Lines 689 to 721 in f65e4b6 The typical/recommended method currently is to create a view on top of the snapshot which renames those metafields to whatever you want them to be. |
Beta Was this translation helpful? Give feedback.
-
@T-Dunlap, @graciegoheen, and I got a chance to discuss snapshots on a video call today. Some of the items we discussed:
|
Beta Was this translation helpful? Give feedback.
-
Hi, are there any plans for handling data type changes to a snapshotted table? Reading this thread, the solutions suggested are fairly manual, it would be amazing if dbt handled this more gracefully (e.g. some ability to set default behavior for example, renaming the column with the previous datatype and creating a new column) https://discourse.getdbt.com/t/snapshots-when-column-data-type-changes/10452/2 |
Beta Was this translation helpful? Give feedback.
-
We’ve launched an initiative this quarter to give snapshots some
We'll be running a community feedback session in a couple weeks about dbt snapshots: Thursday, 13 June, 8am Pacific: dbt snapshots as a first class citizen Please join us to see the problems we’re trying to solve and the designs we’re considering. Some supporting resources: 👉 Register here 👈 |
Beta Was this translation helpful? Give feedback.
-
Jumping in on this discussion, I'm not sure if I have a snapshot problem/solution or more of a general question, but I was wondering if this post from Claire is still the best/most recent recommendation around handling snapshots across different environments. Basically, what is the recommended config for snapshots when (for example) you have 3 different databases for dev, staging, and prod? And, how does defer possibly play into this? Since snapshots are more like sources than models, is there special behavior around how snapshots are handled... or should there be? |
Beta Was this translation helpful? Give feedback.
-
Hi! Jumping a bit late to the discussion but here's some changes to snapshot that we've had to make: Running snapshots per batch on a loop.
Some sources have hard-deletes happening inside a window. In essence, they are partial loads of a source table: "Always extract previous 7 days in one batch"
For some really huge tables (in the order 10s of Billions of rows), we've added yet another optimization (which doesn't allow for hard-deletes)
|
Beta Was this translation helpful? Give feedback.
-
Randomly came back here and couldn't believe no one called out the fact that we don't yet have a good CI workflow for snapshots. This one always catches folks out because they have some logic that varies the -- macros/gsn.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if target.name == 'prod' -%}
{{ custom_schema_name }}
{%- elif target.name == 'ci' -%}
{{ default_schema }}
{%- endif -%}
{%- endmacro %}
-- snapshots/snappy.sql
{% snapshot snappy %}
{{
config(
target_schema=generate_schema_name('snapshots'),
unique_key='id',
strategy='check',
check_cols='all'
)
}}
select 1 id, 'alice' as first_name
{% endsnapshot %}
The outcome of something like this is that snapshots are then always marked "state:modified" in CI runs, and because snapshots are usually at the start of the DAG, users always have confusion as to why are so many models being executed in my CI run (
^ I modify https://discourse.getdbt.com/t/using-dynamic-schemas-for-snapshots/1070/5 In any case, snapshots are usually snapping data that is ephemeral or changing without history - that in itself makes it difficult to have a CI workflow unlike just a normal typical model. |
Beta Was this translation helpful? Give feedback.
-
Jumping into the discussion to add one suggestion/idea. Sorry if this was already discussed, I couldn't find it. One complaint I hear from time to time is that you can't create a snapshot if the source has duplicates. Imagine we have a source like
It would be nice if the snapshot, in its first run, could read this source and be built like
We could have something similar to incremental models
It would only work for the timestamp strategy, because the snapshot must know what is older and what is newer. Just an idea, maybe there are other ways to do it. But just because it is bad when we can't create a snapshot on sources that already have some history. By the way, loved that snapshots are in the spotlight! 🧡 |
Beta Was this translation helpful? Give feedback.
-
Problem:dbt Snapshots build in prod, even when people are working in dev. The rationale:https://docs.getdbt.com/faqs/Snapshots/snapshot-target-schema Thoughts:This never made sense to me.
If you need prod-like data in your dev schema, you can Solution:We need to consider backwards compatibility. Abruptly changing the default behavior would be disruptive. |
Beta Was this translation helpful? Give feedback.
-
I don't have time to read this thread today sadly, but I have a design pattern(s) relying on current timestamp behavior. I hope sharing it earlier than later is useful. I presented it at a Chicago Meetup earlier this year. Slide deck. Example code. |
Beta Was this translation helpful? Give feedback.
-
At one client, we had to modify the snapshot macro due to this: |
Beta Was this translation helpful? Give feedback.
-
I asked my network for suggestions and here is what they said:
Again, thanks for opening this discussion! |
Beta Was this translation helpful? Give feedback.
-
I've seen an antipattern where incremental models are used when snapshots should be used instead - since incremental models act so much like "upserts", it is very tempting to use them to accumulate data over time (with the intention of never fully refreshing them). But, using incremental models in this way breaks idempotency, and this is where snapshots should be used. I wonder if in the same way that model versions get a view that reflects the current version, snapshots could also all get a view created that is the more user friendly way to query the accumulated data (functionally what the incremental model version would look like, just getting the most recent version of the data). |
Beta Was this translation helpful? Give feedback.
-
Some things people have run into
updated_at
is configurable, it is possible to run snapshots out of order (as seen here)updated_at
can be a data type that is "aware" while that given by thesnapshot_get_time()
macro is "naive" (like described here), which can lead to overlapping data.snapshot_get_time()
macro always gave a data type that was the most rich/aware available within the database?snapshot_get_time()
? Maybe a config namedsnapshot_time
that can be an expression or a call to a macro (that yields an expression)? Seehard_deletes_updated_at
below for an alternative.check
strategy, but it is not possible for thetimestamp
strategy. Rather, the only option fortimestamp
strategy is to override thesnapshot_get_time()
macro which applies to all snapshots.hard_deletes_updated_at
configuration? Seesnapshot_time
above for an alternative.Naive vs. Aware timestamps
One of the trickiest things that can happen within snapshots is when there are naive timestamps (rather than aware). In those cases, we need clear ways for the user to configure a "mutual agreement" with the dbt system how to interpret those timestamps when they are actively involved in the snapshot configuration. Up until this point, the approach has been that naive timestamp must be UTC and there is no way to specify anything else. Some users may want to use an adjacent column that contains the relevant UTC offset or time zone (or configure the time zone globally for the dbt project (like this) or specifically for one model).
Comparison of
current_timestamp
andsnapshot_get_time
macros across adaptersClick to toggle
There are (4) different data types for timestamps observed across databases and programming languages.
Java 8’s Time API (JSR-310) provides exemplars of each type (ordered here from most to least precise):
Snowflake
Snowflake has 3 of the 4 types (but not the most precise one):
Postgres
Postgres has 2 of the 4 types (one "aware" and one "naive"):
Redshift
Redshift has 2 of the 4 types (one "aware" and one "naive"):
BigQuery
BigQuery has 2 of the 4 types (one "aware" and one "naive"):
Spark
Spark only has 1 of the 4 types (and it is "aware"):
Beta Was this translation helpful? Give feedback.
All reactions