Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include hard-deletes when making snapshot #2749

Merged

Conversation

joelluijmes
Copy link
Contributor

@joelluijmes joelluijmes commented Sep 11, 2020

Continues the work of the closed PR #2355, which resolves #249.

Description

The idea is to select ids from the snapshotted table which no longer exist in the source table, and update these to set dbt_valid_to to the current timestamp. Marking them as hard deleted.

As of now, I didn't write any integration tests yet as I'm having difficulties running them locally. Any pointers on how these work would be nice. Running integration test (postgres) locally currently fails, but I'm unsure why.

scheduling tests via LoadScheduling

test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleSnapshotFiles::test__postgres_ref_snapshot 
test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleSnapshotFiles::test__postgres__simple_snapshot 
test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestCustomSnapshotFiles::test__postgres__simple_custom_snapshot 
test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleColumnSnapshotFiles::test_postgres_renamed_source 
[gw1] PASSED test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleSnapshotFiles::test__postgres_ref_snapshot 
test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestNamespacedCustomSnapshotFiles::test__postgres__simple_custom_snapshot_namespaced 
[gw2] FAILED test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleColumnSnapshotFiles::test_postgres_renamed_source 

============================================================================================ FAILURES =============================================================================================
___________________________________________________________________ TestSimpleColumnSnapshotFiles.test_postgres_renamed_source ____________________________________________________________________
[gw2] linux -- Python 3.6.9 /usr/app/.tox/integration-postgres-py36/bin/python

self = <test_simple_snapshot.TestSimpleColumnSnapshotFiles testMethod=test_postgres_renamed_source>

    @use_profile('postgres')
    def test_postgres_renamed_source(self):
>       self._run_snapshot_test()

test/integration/004_simple_snapshot_test/test_simple_snapshot.py:158: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/integration/004_simple_snapshot_test/test_simple_snapshot.py:135: in _run_snapshot_test
    self.run_dbt(['snapshot', '--vars', '{seed_name: seed_newcol}'])
test/integration/base.py:578: in run_dbt
    "dbt exit state did not match expected")
E   AssertionError: False != True : dbt exit state did not match expected
-------------------------------------------------------------------------------------- Captured logbook call --------------------------------------------------------------------------------------
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: test connection "__test" executing: DROP SCHEMA IF EXISTS "test15998113637441561482_simple_snapshot_004" CASCADE
[DEBUG] dbt: Opening a new connection, currently in state init
[DEBUG] dbt: On __test: Close
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: test connection "__test" executing: CREATE SCHEMA "test15998113637441561482_simple_snapshot_004"
[DEBUG] dbt: Opening a new connection, currently in state closed
[DEBUG] dbt: On __test: Close
[INFO] dbt: Invoking dbt with ['--strict', '--test-new-parser', 'seed', '--profiles-dir', '/tmp/dbt-int-test-1dn56bze', '--log-cache-events']
[INFO] dbt: Invoking dbt with ['--strict', '--test-new-parser', 'snapshot', '--profiles-dir', '/tmp/dbt-int-test-1dn56bze', '--log-cache-events']
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: test connection "__test" executing: select * from dbt.test15998113637441561482_simple_snapshot_004.my_snapshot
[DEBUG] dbt: Opening a new connection, currently in state init
[DEBUG] dbt: On __test: Close
[INFO] dbt: Invoking dbt with ['--strict', '--test-new-parser', 'snapshot', '--vars', '{seed_name: seed_newcol}', '--profiles-dir', '/tmp/dbt-int-test-1dn56bze', '--log-cache-events']
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: Acquiring new postgres connection "__test".
[DEBUG] dbt: test connection "__test" executing: DROP SCHEMA IF EXISTS "test15998113637441561482_simple_snapshot_004" CASCADE
[DEBUG] dbt: Opening a new connection, currently in state closed
[DEBUG] dbt: On __test: Close
[DEBUG] dbt: Connection '__test' was properly closed.
===================================================================================== slowest test durations ======================================================================================
9.03s call     test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleColumnSnapshotFiles::test_postgres_renamed_source
6.44s call     test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleSnapshotFiles::test__postgres_ref_snapshot

(0.00 durations hidden.  Use -vv to show these durations.)
===================================================================================== short test summary info =====================================================================================
FAILED test/integration/004_simple_snapshot_test/test_simple_snapshot.py::TestSimpleColumnSnapshotFiles::test_postgres_renamed_source - AssertionError: False != True : dbt exit state did not m...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================================= 1 failed, 1 passed in 60.78s (0:01:00) ==============================================================================
ERROR: InvocationError for command /bin/bash -c '/usr/app/.tox/integration-postgres-py36/bin/python -m pytest --durations 0 -v -m profile_postgres -s -x -m profile_postgres test/integration/004_simple_snapshot_test/test_simple_snapshot.py -n4 test/integration/*' (exited with code 2)
_____________________________________________________________________________________________ summary _____________________________________________________________________________________________
ERROR:   integration-postgres-py36: commands failed

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Sep 11, 2020
@beckjake
Copy link
Contributor

That error message usually means that the command failed when dbt expected it to succeed.

A good way to run just that test is to use the -k flag. Here's what I would run for this case:

docker-compose run --rm test tox -e explicit-py36 -- -s -m profile_postgres -k test_postgres_renamed_source test/integration/004_simple_snapshot_test/test_simple_snapshot.py

That will run just the failing test and spit the full output to stdout instead of capturing it, which will probably be more informative given the error.

@joelluijmes
Copy link
Contributor Author

That error message usually means that the command failed when dbt expected it to succeed.

A good way to run just that test is to use the -k flag. Here's what I would run for this case:

docker-compose run --rm test tox -e explicit-py36 -- -s -m profile_postgres -k test_postgres_renamed_source test/integration/004_simple_snapshot_test/test_simple_snapshot.py

That will run just the failing test and spit the full output to stdout instead of capturing it, which will probably be more informative given the error.

Thanks, in the log it stated that the disk space was full :)

Anyhow, now that I'm able to run the tests, I'm having difficulties writing a test for it. I thought this could be easily (read: hackish) done by deleting one of the seed rows in the invalidate.sql file. However to build the snapshot_expected table, it also uses the seed table.

The following issue is that the test asserts by checking the actual rows / values. This PR however, marks the dbt_valid_to the current timestamp (we don't know when the record exactly was deleted, just somewhere between the last ran and now). How could I assert for that? Because there may be some (milli)seconds delay between writing the snapshot and assertion.

Maybe the easiest is to write a complete new test, but I'm not sure how to do that in the dbt project (+ no experience with tox). Any pointers?

@beckjake
Copy link
Contributor

beckjake commented Sep 14, 2020

There's a library called freezegun that we use in a few tests, maybe you can use that to fix the date? Check out the 025_timezone_tests test case, it uses @freeze_time to lock the current time. I'm not 100% sure that will work.

You might just have to write a new test and assert that the time is between two times you collect (before + after the dbt snapshot call). To add a new test, just create a new method that starts with test_. If you need a new class, you can make that as well, it just has to have a name that starts with Test, subclass DBTIntegrationTest, and override models + schema

@joelluijmes
Copy link
Contributor Author

joelluijmes commented Sep 15, 2020

Okay, I wrote a test for checking the deleted records. I just used the same seed data and deleted the last records, and the test validates that their dbt_valid_to is set 👍

The tests of test/rpc still fail. The reason is that in test/rpc/test_snapshots.py, the updated_at value is a string and not real timestamp. This resulted in mixing string types with timestamp.

UNION types text and timestamp without time zone cannot be matched

For now, I just changed the data type to reflect an actual timestamp. This got me thinking though, my solution will only work if the source updated_at is an actual timestamp (since I'll be filling now() for deleted records). For me this seems fine since you explicitly select the timestamp strategy. However, I believe (in theory) you can use any data type as long less/greater than operators work.

What is your opinion on that? (Edit: and if you disagree, any suggestions how to mitigate the problem of mixing types?)

--

Additionally, I wanted to add a config option to make this feature opt-in since it changes the current behavior (and may clash with existing expectations. Was also mentioned on the earlier PR.)

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I wanted to add a config option to make this feature opt-in since it changes the current behavior

Strongly agree! Users are already quite accustomed to configuring snapshots, so one more config (remove_hard_deletes?) makes perfect sense.

test/rpc/test_snapshots.py Outdated Show resolved Hide resolved
@joelluijmes
Copy link
Contributor Author

joelluijmes commented Sep 15, 2020

Okay made it opt-in, hopefully this also fixes the failing redshift test 🤞

I think the PR is ready for code review 😇 .

EDIT: what should be the base branch before merging? The new default branch or the dev/0.18.1?

@joelluijmes joelluijmes marked this pull request as ready for review September 15, 2020 14:20
@joelluijmes joelluijmes changed the base branch from dev/marian-anderson to dev/kiyoshi-kuromiya September 15, 2020 14:47
@joelluijmes
Copy link
Contributor Author

Okay the redshift tests still seem to fail. However it fails at test_docs_generate.py, unfortunately the logs doesn't really make sense, and I don't have access to Redshift..

Can someone check to see what the problem might be?

@beckjake
Copy link
Contributor

Don't worry about those redshift failures right now. Amazon pushed an update out that automatically adds dist and sort keys to our tables, which breaks the docs tests that check for dist/sort keys and expect them to not exist. We'll probably just merge without that test passing, I have a fix for it in another open PR.

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I've made a small suggestion for a couple changes but I think this is generally good to go.

Should this work for other adapters? Even if this is postgres/redshift only, I'm totally fine with merging it, but if we expect it to work on snowflake/bigquery we should test it.

@joelluijmes
Copy link
Contributor Author

This looks great! I've made a small suggestion for a couple changes but I think this is generally good to go.

Done ✅

Should this work for other adapters? Even if this is postgres/redshift only, I'm totally fine with merging it, but if we expect it to work on snowflake/bigquery we should test it.

Yea it should be database agnostic, but to proof it... I added tests for the other databases (bigquery, snowflake and redshift) 🥳.

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the new redshift test is failing! I wonder if it's because redshift uses the postgres code you changed in snapshot_merge.sql, and redshift doesn't support text as a real type? I don't see any actual errors in the output though...

_________ TestSnapshotHardDelete.test__redshift__snapshot_hard_delete __________
[gw0] linux -- Python 3.6.9 /home/dbt_test_user/project/.tox/integration-redshift-py36/bin/python

self = <test_simple_snapshot.TestSnapshotHardDelete testMethod=test__redshift__snapshot_hard_delete>

    @use_profile('redshift')
    def test__redshift__snapshot_hard_delete(self):
        self._seed_and_snapshot()
        self._delete_records()
>       self._invalidate_and_assert_records()

test/integration/004_simple_snapshot_test/test_simple_snapshot.py:856: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/integration/004_simple_snapshot_test/test_simple_snapshot.py:831: in _invalidate_and_assert_records
    self.assertIsInstance(result[-1], datetime)
E   AssertionError: None is not an instance of <class 'datetime.datetime'>

@joelluijmes
Copy link
Contributor Author

Oef, good catch! Hm that is weird error though.. None implies that it didn't actually snapshot the hard-deleted data.

I don't think it's due the changes in snapshot_merge.sql as it is basically a copy of the original logic. In my test I tried to imitate the logic of TestSimpleSnapshotFiles (as in, using the test-snapshots-pg for non-bigquery targets and the same seed data)...

I need to dig into this, maybe open up a AWS trial to replicate it locally 🤓

@beckjake
Copy link
Contributor

beckjake commented Sep 16, 2020

Ok! You can also feel free to set it up so that Redshift raises an error when this config is enabled, and a future contributor who wants Redshift support or a team member can add support later (as long as the existing behavior continues to work ok - that could involve copy+pasting the postgres macro into redshift, but without the deleted clause). There's plenty of precedent for user-contributed features to only apply to the databases they personally care about.

It's not reasonable on our part to require you to create a redshift account for this.

@joelluijmes
Copy link
Contributor Author

Okay great, we kinda need this feature soon.. 😬
Anyhow, I'll probably spend a little time on debugging this (as I'm kinda intrigued why it doesn't work), if I can't make it work without too much effort I'll take your suggestion 👍

@joelluijmes
Copy link
Contributor Author

Okay, fixed 😄 Test failed because the results were not sorted as I expected during assertions.

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Thank you for contributing to dbt, @joelluijmes!

@joelluijmes
Copy link
Contributor Author

Noice! This config probably needs to be included in documentation, is that something I should do or do you (fishtown) generally update the documentation? I'm fine with doing it myself, but my English isn't perfect 😇

@jtcohen6
Copy link
Contributor

@joelluijmes If you're up to give a first go at docs for this, I'd be more than happy to review + help out! I think an additional configuration + FAQ on this page would be a good place to start.

@joelluijmes
Copy link
Contributor Author

@joelluijmes If you're up to give a first go at docs for this, I'd be more than happy to review + help out! I think an additional configuration + FAQ on this page would be a good place to start.

dbt-labs/docs.getdbt.com#377

@clrcrl
Copy link
Contributor

clrcrl commented Sep 18, 2020

Chiming in here! Wowee, I closed that original PR and never thought that someone else would pick it up! Nice work @joelluijmes 👍

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @joelluijmes! This has been a long time coming, and I know several members of the community are thrilled to see it. Thanks for writing some excellent docs as well.

You can pull in changes from dev/kiyoshi-kuromiya to get fixes for failing Redshift tests, or @beckjake can merge this in as-is as an admin. I'm happy with either.

@joelluijmes
Copy link
Contributor Author

Just merged as suggeested, but uhoh now it fails at test/integration/029_docs_generate_tests/test_docs_generate.py

@beckjake
Copy link
Contributor

I hit that merge issue as well, and I don't know why! Weird git things. I think the fix is to add unrendered_config: {}, after this line: https://github.com/fishtown-analytics/dbt/blob/e96cf02561a2b50ccfbf7c02cbc9c1af6048ace1/test/integration/029_docs_generate_tests/test_docs_generate.py#L1438

@joelluijmes
Copy link
Contributor Author

Is there something I can do to fix the build? Or will you be taking the PR over from here?

@beckjake
Copy link
Contributor

I think this is fine, that test failure was just some ephemeral Windows thing. I'll merge this. Thanks again for contributing to dbt @joelluijmes!

@beckjake beckjake merged commit 8ee490b into dbt-labs:dev/kiyoshi-kuromiya Sep 23, 2020
@seunghanhong
Copy link

Hi @joelluijmes, this does not reinstate deleted records when they reappear in the source data. Any suggestions?

@joelluijmes
Copy link
Contributor Author

joelluijmes commented Oct 5, 2020

Hi @joelluijmes, this does not reinstate deleted records when they reappear in the source data. Any suggestions?

That is true, I kinda hoped it would be added back as an update but that is not the case. Snippet;

updates as (

    select
        'update' as dbt_change_type,
        source_data.*,
        snapshotted_data.dbt_scd_id

    from updates_source_data as source_data
    join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
    where snapshotted_data.dbt_valid_to is null
    and (
        {{ strategy.row_changed }}
    )
)

@beckjake @jtcohen6 what do you think? I think, in theory we could remove the where snapshotted_data.dbt_valid_to is null clause. But, I'm not sure if that breaks anything I might be missing?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 5, 2020

Interesting problem. I'm not sure of how common it is for hard-deleted records to return from oblivion, and I'd also be willing to document this as a known limitation of invalidate_hard_deletes.

That said, if we're interested in handling this case: my instinct here would be to treat these "returning" records as insertions rather than updates. I'm thinking this might work just by altering the snapshotted_data CTE, used in downstream joins for identifying inserts + updates, by adding where dbt_valid_to is null.

That way, the previously-hard-deleted record is not included for comparison and its new record is treated like any other new record with a net-new primary key. The time between the previous dbt_valid_to and the new dbt_valid_from is the time during which that record was in hard-deletion limbo.

@joellabes
Copy link
Contributor

joellabes commented Oct 5, 2020

I'm not sure of how common it is for hard-deleted records to return from oblivion

There's a couple of use cases for us:

  1. An EAV table, where the PK is a combination of org_id and attribute_key. They can be deleted and recreated over time
  2. A table that links a class to its students and teachers. Users' class memberships change over time, and there's nothing that says that once they leave they can never return.

I'd love to see a is_deleted column as part of the snapshot, which is just another dimension that slowly changes. Right now, we're making snapshot sandwiches to achieve that goal.

@seunghanhong
Copy link

I've implemented this on top of @joelluijmes's solution and it works.

    {%- if strategy.invalidate_hard_deletes %}

    deletes_source_data as (

        select 
            *,
            {{ strategy.unique_key }} as dbt_unique_key
        from snapshot_query
    ),

    deleted_snapshotted_data as (
        select
            {{ strategy.unique_key }} as dbt_unique_key
        from snapshotted_data
        group by
            {{ strategy.unique_key }}
        having count(dbt_valid_to) = count(*)
    ),
    {% endif %}
    {%- if strategy.invalidate_hard_deletes -%}
    ,

    deletes as (
    
        select
            'delete' as dbt_change_type,
            source_data.*,
            {{ snapshot_get_time() }} as dbt_valid_from,
            {{ snapshot_get_time() }} as dbt_updated_at,
            {{ snapshot_get_time() }} as dbt_valid_to,
            snapshotted_data.dbt_scd_id
    
        from snapshotted_data
        left join deletes_source_data as source_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_valid_to is null
        and source_data.dbt_unique_key is null
    ),

    reinsertions as (
        select
            'insert' as dbt_change_type,
            source_data.*
        from insertions_source_data as source_data
        join deleted_snapshotted_data on deleted_snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
    )
    {%- endif %}
    {%- if strategy.invalidate_hard_deletes %}
    union all
    select * from deletes
    union all
    select * from reinsertions
    {%- endif %}

@seunghanhong
Copy link

seunghanhong commented Oct 6, 2020

I think, in theory we could remove the where snapshotted_data.dbt_valid_to is null clause. But, I'm not sure if that breaks anything I might be missing?

If it's removed, it will insert 'update' records based on historical records because the filter is to select current snapshot records. In the following case, it will insert a record (id: 2) from the source table to the target.

source table

id col1
1 val1
2 val2
3 val3

target table

id col1 dbt_valid_from dbt_valid_to
1 val1 2020-09-01 null
2 val2-previous 2020-09-01 2020-09-02
2 val2 2020-09-02 null
3 val3 2020-09-01 null

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 6, 2020

@joellabes Ah, that's helpful context. It makes sense that there's quite a bit of fluidity in many-to-many mapping tables.

@seunghanhong You're right on about why we can't remove where snapshotted_data.dbt_valid_to is null. If anything, I think we should add that line further up.

I may not fully grasp all the logic in the approach you outline further up. At the moment, I don't see why all those additional CTEs is necessary; I think it'd be sufficient to treat "returns" like any other insertions, rather than specifying a new category of reinsertions, by updating this CTE to read:

    snapshotted_data as (

        select *,
            {{ strategy.unique_key }} as dbt_unique_key

        from {{ target_relation }}
        where dbt_valid_to is null

    ),

In any case, I'd welcome a new issue on the subject of handling hard-delete "returns" so we can move the discussion there and determine which is the most effective and concise approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

snapshots should handle hard deletes
6 participants