Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable rows deduplication if Incremental is attached to a resource with merge write disposition #1131

Closed
rudolfix opened this issue Mar 22, 2024 · 1 comment · Fixed by #1877
Assignees
Labels
sprint Marks group of tasks with core team focus at this moment

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Mar 22, 2024

Feature description

Incremental will remove duplicating rows (only with the same cursor field value, read the docs) based on content hash or primary key. This is not expected behavior when merge key is present - destination should be allowed to merge changes coming from the new records.
We want to disable "deduplication" in Incremental and expose destination to all the rows from data source.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

Please see #971 (comment)

Proposed solution

    • disable deduplication by setting primary_key of Incremental to () (disable dedup) when resource has "merge" write disposition. we already propagate resource key to Incremental key so change should be easy
    • keep explicitly set Incremental primary key (should already work like that)
    • observe apply_hints changes in the resource (_set_hints) and if primary
    • make sure that using with_hints marker that is executed during resource execution is able to apply write disposition and disable the deduplication
    • also test for transformers that bind() the incremental late
    • warn if dedup state grows too big ie. > 200 entries

Related issues

No response

@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Apr 22, 2024
@rudolfix rudolfix moved this from In Progress to Planned in dlt core library Apr 22, 2024
@rudolfix rudolfix moved this from Planned to Todo in dlt core library Jun 3, 2024
@rudolfix rudolfix moved this from Todo to Planned in dlt core library Aug 25, 2024
@rudolfix rudolfix self-assigned this Aug 25, 2024
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Sep 11, 2024
@rudolfix rudolfix added the sprint Marks group of tasks with core team focus at this moment label Sep 11, 2024
@rudolfix rudolfix moved this from In Progress to Planned in dlt core library Sep 11, 2024
@rudolfix rudolfix removed their assignment Sep 23, 2024
@willi-mueller willi-mueller linked a pull request Sep 26, 2024 that will close this issue
@willi-mueller
Copy link
Collaborator

Implementation strategy:

  1. read the write_disposition from the table_schema_template in DltResource._set_hints()
  2. read the write_disposition from the hints_template in DltResourceHints._set_hints()
  3. set the primary key of the incremental object to () because if it's a Sequence and emty it causes no deduplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sprint Marks group of tasks with core team focus at this moment
Projects
Status: Done
3 participants