Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request for Comment] Should destinations tombstone or remove CDC deletions? #31239

Closed
evantahler opened this issue Oct 10, 2023 · 1 comment
Closed
Assignees
Labels
team/destinations Destinations team's backlog

Comments

@evantahler
Copy link
Contributor

evantahler commented Oct 10, 2023

Hello Airbyte Community!

Now that we've launched Destinations V2 (#26028), we are able to more easily make changes to how data is represented in your data warehouse's final tables based on your feedback. The first area of investigation is how we represent deletions from CDC sources (e.g. from source-postgres).

Today, if a row is deleted in your source, we remove it in the Destination's final table as part of the deduplication process - the row will not be present in the final table. However, an alternative exists in which we leave the row in the final table, but have a "Tombstone" or "Soft Delete" column present that is either null (the row exists in the source) or non-null (the row has been deleted from the source), e.g.:

id name updatd_at _ab_cdc_deleted_at
1 Evan 2023-01-01 null
2 Edward 2023-01-02 2023-01-03

In the example above, "Evan" (user 1) exists in the source, and "Edward" (user 2) has been deleted. In most cases (depending on the CDC source), the tombstone column will be _ab_cdc_deleted_at, a timestamp. You would gain the additional information about when the row was deleted as well. Should you want a view of your data in the destination which more closely resembles the source (e.g. current behavior), you can filter out WHERE _ab_cdc_deleted_at IS NULL and either make a new table or view for your downstream analysis.

Of note, we currently only remove deleted rows for CDC database sources. There are many API source which also provide a deletion/tombstone column (e.g. source-salesforce) whose records will remain in your final table. Switching CDC deletes to soft-deletes would homogenize how Airbyte works for all sources. It also has the benefit of speeding up the Typing and Deduping process.

So... which do you prefer? Respond with a 👍 to change soft-deleting (leaving the row in the final tables + tombstone column) or a 👎 to keep the existing behavior and remove deleted rows from the final table.

@evantahler evantahler added the team/destinations Destinations team's backlog label Oct 10, 2023
@evantahler evantahler changed the title [Request for Comment] Should destinations soft-delete or hard-delete CDC deletions? [Request for Comment] Should destinations tombstone or remove CDC deletions? Oct 10, 2023
@evantahler evantahler self-assigned this Oct 10, 2023
@evantahler
Copy link
Contributor Author

Some of the reasons I'm on team "soft delete"

  • There are real use cases for auditing deletes which Airbyte today cannot support. Soft deletes allows that
  • There's a very tight coupling today between source and destination that is tricky and unnecessary. E.g if the source wants to change the name of the tombstone column from _ab_cdc_deleted_at to deleted_at, it requires breaking changes in the source and destination
  • I want to speed up Typing and Deduping

@airbytehq airbytehq locked and limited conversation to collaborators Oct 10, 2023
@evantahler evantahler converted this issue into discussion #31242 Oct 10, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
team/destinations Destinations team's backlog
Projects
None yet
Development

No branches or pull requests

1 participant