Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication when using ReplicatedMergeTree with delete+insert incremental strategy #213

Closed
the4thamigo-uk opened this issue Nov 28, 2023 · 15 comments · Fixed by #214
Closed
Assignees
Labels
bug Something isn't working

Comments

@the4thamigo-uk
Copy link
Contributor

Describe the bug

Steps to reproduce

  1. build an incremental model using Replicated merge tree and delete+insert strategy
  2. populate the model as full refresh
  3. re-run as incremental
  4. re=run as incremental

Expected behaviour

step 4 should delete the records inserted in step 3 and reinsert them

Actual behaviour

The records are not reinserted in step 4

Explanation

It looks like when you insert exactly the same set of records in the incremental update, the replication deduplication mechanism kicks in and ignores the new block of rows.

I am working around this by adding an alter table... modify setting hook, but it is not ideal. It would be nice if I could specify this as a setting when the table is created, or alternatively that it is set automatically when using the delete+insert strategy, as otherwise it wouldnt be guaranteed to work.

@the4thamigo-uk the4thamigo-uk added the bug Something isn't working label Nov 28, 2023
@genzgd
Copy link
Contributor

genzgd commented Nov 28, 2023

Just to clarify, the problem is that in Step 4, the lightweight DELETE occurs, but the reinsert does not because of built in ClickHouse deduplication, so those records are just missing?

It seems reasonable to disable deduplication for the destination table, I'll take a look.

@the4thamigo-uk
Copy link
Contributor Author

Yep thats what seems to happen.

@genzgd
Copy link
Contributor

genzgd commented Nov 28, 2023

For the short term it seems like intead of using a hook you should be able to add the insert_deduplicate: 0 into the model config settings, which should be automatically picked up by this code:

 @available
    def get_model_settings(self, model):
        settings = model['config'].get('settings', dict())
        res = []
        for key in settings:
            res.append(f' {key}={settings[key]}')
        return '' if len(res) == 0 else 'SETTINGS ' + ', '.join(res) + '\n'

Is that a fix you can validate quickly?

My current thought is to add insert_deduplicate: 0 automatically into this method as a fix, and have an override configuration value if the user really wants to leave ClickHouse deduplication on.

@the4thamigo-uk
Copy link
Contributor Author

Ah that looks like a neater solution. Give me 15mins to try it out.

@the4thamigo-uk
Copy link
Contributor Author

the4thamigo-uk commented Nov 28, 2023

DB::Exception: Unknown setting insert_deduplicate: for storage ReplicatedMergeTree. (UNKNOWN_SETTING) (version 23.10.3.5 (official build)). (UNKNOWN_SETTING) (version 23.10.3.5 (official build))

@genzgd
Copy link
Contributor

genzgd commented Nov 28, 2023

Ugh, apparently it doesn't work for a table level setting. I'm curious how your ALTER TABLE statement works in that case.

I'm thinking an alternative solution is just to provide a unique insert_deduplication_token in the macro for each insert.

@the4thamigo-uk
Copy link
Contributor Author

For reference I added it like this :

{{
  config(
    ...
    settings={
      'insert_deduplicate':0
    }
  )
}}

@the4thamigo-uk
Copy link
Contributor Author

the4thamigo-uk commented Nov 28, 2023

Oh, these are table level settings... I figured these were settings on the INSERT? If so then replicated_deduplication_window=0 should work

@the4thamigo-uk
Copy link
Contributor Author

Yeah the above works... thanks... I wasnt aware you could set the table SETTINGS this way.

@the4thamigo-uk
Copy link
Contributor Author

Probably need the settings option in the readme?

@genzgd
Copy link
Contributor

genzgd commented Nov 28, 2023

cool, that seems like a reasonable solution. And yeah, I think that needs to be called out better in the documentation.

@the4thamigo-uk
Copy link
Contributor Author

Still maybe worth defaulting to this option for this strategy, do you think?

@genzgd
Copy link
Contributor

genzgd commented Nov 28, 2023

Yes, I think it's a good default for any of the dbt incremental strategies. The ClickHouse automatic deduplication is sort of an unexpected gotcha in such a high level tool.

@the4thamigo-uk
Copy link
Contributor Author

Yeah, definitely had me bamboozled earlier... thanks for speedy response. Ill leave this open for you to decide how you want to deal with this. Thanks again.

@genzgd
Copy link
Contributor

genzgd commented Nov 28, 2023

You're welcome, thanks for the report. It just so happens I'm in the middle of a lot of dbt-clickhouse work, so this is good timing :)

@genzgd genzgd self-assigned this Nov 28, 2023
@genzgd genzgd linked a pull request Nov 29, 2023 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants