Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshots should handle hard deletes #249

Closed
jthandy opened this issue Dec 13, 2016 · 18 comments · Fixed by #2749
Closed

snapshots should handle hard deletes #249

jthandy opened this issue Dec 13, 2016 · 18 comments · Fixed by #2749
Assignees
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality

Comments

@jthandy
Copy link
Member

jthandy commented Dec 13, 2016

Currently, dbt archive stores a series of state changes for records in covered tables, but it does not store a state change for deleted records. Hard deletes in an OLTP database will likely be one of the primary use cases for archival, and we therefore want to make sure that we're capturing record deletion as a state change.

@jthandy jthandy added the enhancement New feature or request label Dec 13, 2016
@jthandy jthandy changed the title dbt archive should handle deletes archival should handle hard deletes Dec 13, 2016
@jthandy
Copy link
Member Author

jthandy commented Dec 13, 2016

this might need to be a completely separate run through the table, which would eat up additional resources. if that turns out to be true, we should allow people to opt out of this in configuration if they know the source data never performs deletes.

@jthandy jthandy modified the milestone: Archival Updates Dec 13, 2016
@drewbanin
Copy link
Contributor

let me have a think about this. definitely a good idea

@drewbanin drewbanin added the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Feb 24, 2017
@jthandy jthandy removed the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Mar 17, 2017
@drewbanin drewbanin modified the milestone: 0.8.2 Release May 19, 2017
@HarlanH
Copy link

HarlanH commented Mar 7, 2018

Just a note that this is a big problem for us, for some of our tables, but the issue isn't likely to be solvable by dbt archive. One example is a table with checklist preferences per user, 10 million rows or so, I think. If we use our usual ELT pattern, this would require us to copy that entire table into our warehouse, multiple times per day, which is prohibitively expensive to do with Stitch or whatever.

To get fully reliable data, we'll almost certainly have to solve the problem upstream, either by converting the table to soft deletes, or adding an audit table of deleted rows so we can do an anti-join later.

@drewbanin drewbanin added the snapshots Issues related to dbt's snapshot functionality label Jul 11, 2018
@drewbanin drewbanin added this to the Wilt Chamberlain milestone Nov 28, 2018
@drewbanin
Copy link
Contributor

drewbanin commented Dec 5, 2018

@jthandy is this something you're still interested in? My instinct is that there are many different ways for hard deletes to be represented in warehouse. Depending on your replication, you might see:

  • a __deleted_at column value from your ETL tool
    • you wouldn't need to capture this with archival
  • the record never gets updated again
    • there's no way to know that a hard-delete happened in the source
  • the record is actually deleted from the table in the warehouse
    • we could determine this in archive, but I'm unsure how common this paradigm is

Lmk if you have a different idea of how this would work

@jthandy
Copy link
Member Author

jthandy commented Dec 6, 2018

@drewbanin it wasn't super-clear in my original issue writeup, but I meant hard deletes in the table on the warehouse side. this can occur for potentially two reasons:

  1. the table is fully re-replicated from scratch from the source data on a periodic basis and there are hard deletes on the source side, so records disappear from the table within the warehouse.
  2. the table is incrementally replicated using binlogs and the deletes are actually propagated. while this isn't something that the ETL providers we use do (instead they would set a deleted flag), it's certainly possible that an ETL pipeline would be designed this way.

I don't think this is an issue with burning criticality, but I have personally run into this a couple of times in client work. if we're touching archive right now we should consider this for inclusion. I won't be offended if we decide not to implement it, but if we don't implement it then we should probably #wontfix it.

@drewbanin
Copy link
Contributor

A regular archival job coupled with a join back to the source table will be able to detect hard deletes that happen on the warehouse side. Might be worth revisiting this after archival blocks are fully implemented. The new archival materialization will work with pluggable "strategies", and this could very well be one of them

@alangsbo
Copy link

Hi!

Pitching in from a question I made in the #support slack channel https://getdbt.slack.com/archives/C2JRRQDTL/p1563527481312500

I believe this feature would add great value in certain use cases. I been involved in multiple cases where this would solve a common problem (and also make the case for using dbt stronger since the platform would handle this instead of in the Extract processes)

I´ve built many analytic platforms on top of ERP systems (often industry specific niche ERP´s) . A common process nowadays is to extract/replicate the source systems on a daily basis and from those replicas build downstream analytic models (perfect use cases for dbt...). Often there is a need for history tracking, including tracking hard deletes in the source systems - objects (orders, customers, actions) since the source systems more often than not does not track history.

The data volumes I´ve come across in these cases are small to moderate. For tracking hard deletes you´ll have to replicate the entire source table and compare to what is already loaded, so it might get tough with huge amounts of data. However, in these cases, full extracts/replicas are made anyway.

So I´ll definitely vote for further investegation of this issue :) Although agreeing with jthandy above that it might not be a burning issue

@drewbanin drewbanin reopened this Jul 22, 2019
@drewbanin
Copy link
Contributor

Thanks for following up @alangsbo!

I'm imagining this query taking some sort of time range as an argument. If a given record was last loaded more than X hours/days since the most recent loaded_at date, then dbt can consider it "deleted". Is that approximately what you had in mind?

@alangsbo
Copy link

I was thinking about a simple join back to the land/replicated table from source system. Assuming that table gets dropped and reloaded every day, so that delted rows simply are missing. If the backwards join on specified key columns does not yield any rows. It is deleted.

I guessing you suggestion assumes some kind of persistence of rows in land area, that would also be a good solution where the history would be kept also in the land layer.

So maybe both? :)

Br/A

@drewbanin
Copy link
Contributor

Ah! Ok - most of the ETL tools we used here will upsert new records into source tables. I didn't think about the drop + reload approach. I think that a join from the snapshot table back to the source table is a good approach here.

@drewbanin drewbanin changed the title archival should handle hard deletes snapshots should handle hard deletes Jul 29, 2019
@drewbanin drewbanin removed this from the Wilt Chamberlain milestone Jul 29, 2019
@drewbanin
Copy link
Contributor

I had cause to play around with some approaches this evening. There's a world where this is a supported snapshot strategy, but it will require some material investments in how those strategies work. Check it out here if you're interested: https://github.com/fishtown-analytics/dbt/compare/feature/snapshot-hard-deletes?expand=1

@davehowell
Copy link

@drewbanin I had a look at that world, it looks good.

I have one tool that subscribes to a postgres WAL replication slot - equivalent to MySQL binlog - and loads to Redshift replicating all transactions using a delete + insert strategy. It does that because upserts are not a real thing in RS and behind the scenes an update involves marking pages dirty, inserting new pages and waiting for vacuum to clean up the mess, so we cut to the chase.

So yeah we do exactly what Tristan described here:

  1. the table is incrementally replicated using binlogs and the deletes are actually propagated. while this isn't something that the ETL providers we use do (instead they would set a deleted flag), it's certainly possible that an ETL pipeline would be designed this way.

I personally think that soft-deletes are the best option all round, but failing that what those other ETL providers do is very sensible, i.e. construct a soft-deleted flag or ideally a deleted_at timestamp. It implies adding that field to every table. In retrospect we should have implemented the "constructed deleted_at", however, we are in control of the upstream database so have applied soft-deletes to all the tables. Not everyone is in that position.

As an absolute last resort, the left join from snapshot to source works fine, and the strategy does work for both fully reloaded tables as well as propagated deletes.

I think a worse position to be in, and where this can't help you, is to have a naive ETL tool that can't replicate hard deletes because they can't be detected using watermarks. The only solutions to that are soft-deletes, full table reloads, or switching to a transaction log shipping tool.

The other challenge with deletes, both hard or soft, is propagating them up a chain of incremental dbt models that aggregate above the deleted-row-granularity. For example if you incrementally aggregate to a day-level, and there are timestamp rows subsequently deleted, how do you flush the deleted contribution out of those aggregates? Naively it implies periodically running a --full-refresh which is less than ideal given the main use-case for incremental models - complex long running queries. A more complicated but targetted way of handing it might be to track deletes at the aggregation level, e.g. which days contain deleted rows, and then reload just those days. Similar to reloading specific partitions.

@jrandrews
Copy link

We are definitely running into this scenario. We do regular full-reloads from our source systems into our warehouse. Our source systems either don't support tracking history on source data or for whatever reason the feeds that we are getting in are just querying current state of various entities and not the historical tracking. They are SaaS platforms which we are querying via APIs -- we don't have access to anything like SQL querying abilities or the underlying transaction log monitoring/output. The volumes of data aren't very large - usually somewhere between a thousand and tens of thousands of rows so it's entirely practical to do trunc-and-load of the entities from the various source systems at least daily.

We are creating the tables in our warehouse (BigQuery in this case) and then configuring them as sources in dbt. Then snapshotting the sources. But it would be great to have a comparison of the source data and if a row disappears, then to end-date the current "valid" record in the snapshot to support the hard-delete.

We don't have soft-deletes from our current source systems but I can also see the need to have a differing potential strategies to handle a delete in the snapshot by detecting a row that has disappeared versus looking for a soft-flag. Probably also a strategy for handling deletions which doesn't per se. So maybe these options?

  1. Default "Detect source hard-delete" -- A row deletion in the source table that the snapshot points to, based on a unique key, marks the corresponding row in the snapshot as end-dated without creating a new row.
  2. Alternate Option Configs and docs #1 "Detect source soft-delete via timestamp" -- Specify a column containing a timestamp in the source data marking when the row was deleted. Although this could get complicated because it might also impact not just the last record in the string of records potentially associated with a particular unique id. In order to make this idempotent, however, if a new record with the same key/id appeared, we would probably need to create yet another record with a new "dbt_valid_from" timestamp.
  3. Alternate Option dbt run is missing output #2 "Detect source soft-delete via flag" -- Specify a column containing a flag, maybe a binary column, which indicates whether a given row has been soft-deleted.
  4. Alternate Option Need dependency resolution #3 "Ignore soft deletes" -- Don't detect missing/deleted rows in the source table that the snapshot is pointed at in any way.

In the snapshot itself for any of these options, of course the "deletes" should always be soft. We would never want to actually hard-delete a record from the snapshot per se.

@mikaelene
Copy link
Contributor

I you can choose, of course soft deletes are to prefer. My client work in a very regulated business. In our use case we are not allowed to store data of “unsubscribed” users. We would very much like the hard-delete snapshots. Where a delete in the source, wipes all history of a user. I guess the use cases are all different which makes this harder to implement.

@elikastelein
Copy link
Contributor

We've been running into this same scenario at GitLab too. We have custom ETL dumping postgres data into snowflake, but the application does constant hard deletes. My current workaround is to keep all of the historical ETL data (3 runs per day) and parse through all of the historical runs in dbt to figure out when a record was deleted. This is not very scalable as the tables are getting very large.

It'd be amazing if snapshots could support this. What @jrandrews describes in option 1 is what I'd be interested in. If a record is included in Airflow jobs A,B,C but not D, I would want valid_to for that snapshot row to be the timestamp of job C.

@davehowell
Copy link

@CaptainEli I'd love to hear more about what you do with custom ETL from postgres to snowflake. This is the wrong forum though, can I pm you? I'm currently migrating our (similar) custom tool from targeting Redshift to now targeting Snowflake.

The more I think about it the less I think dbt should be responsible for this. Further to my comments on propagating deletes through a dependency chain of aggregated tables, I think for historical aggregates that it is both unnecessary and wrong. For example if you have users/customer records that you are obliged to delete for PII reasons, that the records are now gone doesn't mean they weren't valid records at the time aggregates were calculated. Better than hard deleting is scrubbing the PII if reproducing old aggregate calculations is important.

This is not a new problem in the sphere of data warehousing; slowly-changing-dimensions is an old Kimball idea, and data-vault methodology handles this very well.

@elikastelein
Copy link
Contributor

@davehowell yea for sure or you can email me at ekastelein@gitlab.com. I didn't build it, but the ETL code lives here https://gitlab.com/gitlab-data/analytics/-/tree/master/extract/postgres_pipeline

@pempey
Copy link

pempey commented Mar 19, 2020

I came a across source system hard deletes with one of my clients and I am currently in the process of manually checking for and adding rows for the deleted source rows. Having dbt be able to handle this would defiantly be an improvement for me in this use case.

  1. Default "Detect source hard-delete" -- A row deletion in the source table that the snapshot points to, based on a unique key, marks the corresponding row in the snapshot as end-dated without creating a new row

@drewbanin drewbanin added this to the Marian Anderson milestone May 13, 2020
@jtcohen6 jtcohen6 removed this from the Marian Anderson milestone Jul 6, 2020
yu-iskw pushed a commit to yu-iskw/dbt that referenced this issue Aug 17, 2021
…ix_test

Fix the `unpivot` test to correct behaviour of `remove`, Fix the data to align with it. fka dbt-labs#209
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality
Projects
None yet
10 participants