Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3252] [Feature] Add a way to refresh Snapshots in the database #8885

Closed
3 tasks done
Tracked by #10151
Josersanvil opened this issue Oct 24, 2023 · 6 comments
Closed
3 tasks done
Tracked by #10151
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality wontfix Not a bug or out of scope for dbt-core

Comments

@Josersanvil
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Introduce a --full-refresh option specifically for snapshot models, it can be analogous to the existing --full-refresh feature for incremental models. This would enable the rebuilding of snapshots from scratch by dropping the existing target table in the database and re-executing the snapshot query.

Use cases:

  • There may be scenarios where one would want to change the snapshot strategy. A full-refresh would facilitate alignment with the new strategy without causing data inconsistencies.
  • Adding or removing columns from the original snapshot query is another case where a full-refresh would be useful to ensure the new rows don't deviate significantly from the historical data

Just like with incremental models, using full refresh could mean that all historical data in the snapshot table would be permanently deleted. However, there are use cases where such a refresh is not only acceptable but necessary for maintaining data integrity.

Describe alternatives you've considered

The only way would be to drop the tables outside dbt, however, this is not very efficient if you want to refresh multiple snapshots, having to run a separate command to the database for each table. With this option one could do something like:
dbt snapshot --select my-folder --target my-target --full-refresh and dbt would handle rebuilding all the snapshot tables.

Who will this benefit?

No response

Are you interested in contributing this feature?

Happy to contribute if I'm given some pointers to where to look at

Anything else?

No response

@Josersanvil Josersanvil added enhancement New feature or request triage labels Oct 24, 2023
@github-actions github-actions bot changed the title [Feature] Add a way to refresh Snapshots in the database [CT-3252] [Feature] Add a way to refresh Snapshots in the database Oct 24, 2023
@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Oct 24, 2023
@alison985
Copy link

What I don't understand about this request is: how dbt would know what source data was 11/1/22, 11/2/22, 11/3/22, and so on so it could re-compute the snapshot from a start timestamp(e.g. 11/1/22) through the current timestamp?

Ignorable thoughts:

  • I thought snapshots already handled adding and removing of columns fine as long as you use select *? (Yes, it breaks the don't use a * rule.)
  • FWIW, rebuild-a-snapshot would, by my understanding, basically be dbt clone for snapshots. If so, [CT-3162] [Feature] Ability to dbt clone sources #8722 may be worth considering in conjunction. Me citing Dave again: 🎶 A node should be a node should be a node. 🎶
  • If you haven't already seen it, Christine Dixon's "Could you defend your data in court?" talk at Coalesce this year was incredible. I finally have a place to send people to learn these concepts. It's worth seeing regardless, though it does touch on doing this rebuild-a-snapshot concept briefly. Coalesce 2023 website or it will be available on YouTube 2 weeks post conference.

@Josersanvil
Copy link
Author

Josersanvil commented Oct 25, 2023

@alison985 Not sure what you mean by:

re-compute the snapshot from a start timestamp(e.g. 11/1/22) through the current timestamp.

This request is not about re-computing the snapshot from a start timestamp, it's about rebuilding the snapshot disregarding the previous snapshotted data by basically dropping the table and re-creating it just as when you run dbt snapshot for the first time (very similar to how full_refresh works on Incremental models).

As I mentioned above, users of this feature should be aware that it can potentially delete important historic data if not used correctly, just like is the case currently with incremental models. However, if your snapshot definition was flawed from the start, then it makes sense wanting to start from a clean slate.

About the points you suggested:

  • As a matter of fact, snapshots do not support removing columns, nor changing a the data type of a column (understandably). You can read about it here
  • From my understanding, dbt clone just copies a table to a target schema, not sure how that would solve the issue of refreshing a snapshot. You could also change the target schema of the snapshot so they build from scratch in a new place, but this might affect consumers (outside dbt) so performing a full refresh still makes sense in a lot of cases.
  • Thanks for the recommendation about the Coalesce talk, will definitely take a look.

@alison985
Copy link

As I mentioned above, users of this feature should be aware that it can potentially delete important historic data if not used correctly

I apologize, I missed reading that part. As long as that's okay in a scenario, then this makes sense. (I always touch accounting data and it would be a no, no, no go there. At least, not without a full archive and extra audit work.)

  • As a matter of fact, snapshots do not support removing columns, nor changing a the data type of a column (understandably). You can read about it here

Yes, a snapshot doesn't ever remove columns, which is what you'd want in a snapshot since it's purpose is to not lose data. I admit, I did think a column type change would also make a new column(with a number appended?) if it was anything other than to a varchar that it could cast, so thank you for helping teach me! That makes a huge difference and the maintenance overhead to handle it makes me cringe. Off the top of my head, you'd have to turn off running the snapshot, change the name of the existing snapshot table, give the new snapshot table a second new name, turn on snapshotting again starting from that, then have a model with the original name of the snapshot to union the old snapshot and new snapshot table together, then manually update the original snapshot table timestamps so there's no lost time between the old snapshot and the new snapshot. Wow! I feel like filing a feature request about handling this.

  • From my understanding, dbt clone just copies a table to a target schema, not sure how that would solve the issue of refreshing a snapshot. You could also change the target schema of the snapshot so they build from scratch in a new place, but this might affect consumers (outside dbt) so performing a full refresh still makes sense in a lot of cases.
  • Just for knowledge: A dbt clonedoc link can sometimes just be a select * view, which wouldn't be what you want here.
  • The page mentions "blue/green continuous deployment (on data warehouses that support zero-copy cloning tables)" so to your point it could be used, in some circumstances, to build the refreshed snapshot in a new place. Then, to over simplify, "flip" the naming so the new snapshot keeps the same name and that would presumably solve for the affected downstream consumers. You could also do one-time alter table names if you don't have a "zero-copy cloning" database.
  • (Note: I don't think any alternative around versions of snapshots to handle the downstream issue would a good idea. That would be a tremendous amount of space and complexity. Also, the point of snapshots is to keep all the data in the same place )

I now understand your request @Josersanvil. Thank you for helping clarify it for me. Definitely valid use cases to consider this feature request for.

@dbeatty10
Copy link
Contributor

Thanks for opening this @Josersanvil !

And thanks for the additional thoughts and insights @alison985. "Could you defend your data in court?" was an amazing talk 🤩

Not planned

After discussing this with @graciegoheen @jtcohen6 and @dataders we're going to close this as "not planned".

We recognize that there are situations where one might need to start afresh; however, we actually want to keep some barriers here to ensure that it's not done by mistake.

Recommended approaches

We'd encourage you to choose between one of these options (listed in no particular order):

  1. Manually execute drop table... statement in your database via your preferred tool
  2. Rename your snapshot model to something else. The new name will start over from scratch, but you'll still retain the data in a table in the database until you drop it (in case you want to go back to it at a later date).
  3. Write a macro that will perform a drop table... and then execute it with run-operation

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2023
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Nov 1, 2023
@dbeatty10 dbeatty10 removed their assignment Nov 1, 2023
@walker-philips
Copy link

@dbeatty10

While preventing dropped data is a desired barrier. I have had scenarios where Snapshot data was oversnapshotted (values really not changing but a new row snapshotted anyways), changing the columns involved in a Check Strategy to adapt to new business logic, or a weird scenario occurred where the dbt timestamps became out of alignment.

Could there be an approach to instead run a lag/lead query across the existing snapshot's data that merely updates the dbt valid from/tos and condenses any repeated filler between them based on the new Check strategy?

A bad Check strategy on a highly dynamic table could inadvertently spawn millions of essentially duplicated rows. Storage is cheap, but ....

Also, such a feature could potentially allow seamless backfilling into a Snapshot table. Say you dig up some old excel files of business data, transform the files to meet the schema of the Snapshots table. You could insert these rows into the Snapshot and still have linear dbt valid tos/from.

I have a working approach to introducing back fills into Snapshot tables but its not exactly pretty but still happy to share

@dbeatty10
Copy link
Contributor

Sounds interesting @walker-philips -- could you open up a new feature request for the scenario you are describing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

4 participants