insert_overwrite (insert+replace) partitions incremental strategy #201

bryzgaloff · 2023-10-26T17:01:59Z

Summary

This PR implements insert+replace strategy discussed in #128 which does the following:

Creates a new staging table with the same structure as the target table.
Inserts data into the staging table.
Replaces partitions in the target table from the staging table.

Advantages:

Only the involved partitions are replaced: this is much cheaper than reinserting the full table which is implemented in other strategies.
If an insertion fails, the target table is not affected.

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

CLAassistant · 2023-10-26T17:02:05Z

All committers have signed the CLA.

bryzgaloff · 2023-10-26T17:02:12Z

At the time of publishing the PR is in WIP (work in progress) state since I need an advice from the community. Thus, neither documentation nor tests are updated.

dbt/include/clickhouse/macros/materializations/incremental.sql

bryzgaloff · 2023-10-26T17:29:05Z

dbt/include/clickhouse/macros/materializations/incremental.sql

+      inserts_only
+      or unique_key is none
+      and config.get('incremental_strategy', none) != 'insert+replace' -%}


A mix of legacy (insert_only) and new ("append" strategy) approaches to configure the strategy introduces counter-intuitive and even conflicting set of configuration in dbt-clickhouse. To not break the current conditions check flow I have implemented this dirty way of checking, though in general I would suggest refactoring these conditions tree.

And introduce strict config consistency checks: e.g. inserts_only must prohibit using any incremental_strategy apart from "append" (emit a warning in that case since it is redundant) or unspecified. Please let me know if some work here has started: reference an issue or a PR.

To narrow down the reviewed scope, I won't blend the changes into this PR, thus I suggest keeping this dirty-yet-working workaround here.

dbt/include/clickhouse/macros/materializations/incremental.sql

dbt/adapters/clickhouse/impl.py

dbt/include/clickhouse/macros/materializations/incremental.sql

simpl1g · 2023-10-26T18:12:45Z

dbt/include/clickhouse/macros/materializations/incremental.sql

@@ -234,3 +244,36 @@
    {% do adapter.drop_relation(new_data_relation) %}
    {{ drop_relation_if_exists(distributed_new_data_relation) }}
 {% endmacro %}
+
+{% macro clickhouse__incremental_insert_replace(existing_relation, intermediate_relation, partition_by) %}


this strategy should also take care about cluster setup and distributed tables.
We can insert data through Distributed table and then do replace for local table on each shard. This part will be mostly the same as in clickhouse__incremental_delete_insert macro

Should we maybe defer the distributed implementation till the next iteration? My immediate requirement was to support the strategy for single-node setups (which we are using currently with my team). We may leave a chance to others implement and test the cluster-specific version. I won't have a quick option to test it, thus I won't be able to confirm it works.

Your suggestion with system.parts works completely fine, thanks for that! I plan to cover it with tests and document in the next couple of weeks (having a vacation next week :) ).

My team will test the approach on our real use cases to make sure it works.

Not sure about merging without Distributed support, because this is new functionality and mostly experimental, I think this is a question to @genzgd, as he is a maintainer.

But I believe that it should work with cluster from beginning because this is core CH functionality, and most of the production users don't use single server setup

We use dbt with ClickHouse without a cluster currently, so I have not had an opportunity to test it. I suggest on merging the tested version of my PR now. Once someone needs a clustered version of the strategy, they may contribute it too and test themselves.

If I add the clusters-relates logic, I cannot guarantee that it works. However, if the maintainers are ok to rely on their review, I am completely fine with adding a related code snippet.

Please let me know which code snippet I have to add, if it is required. Once again, my suggestion is to proceed with a tested version leaving a room for someone else's contribution.

simpl1g · 2023-10-26T18:21:21Z

dbt/include/clickhouse/macros/materializations/incremental.sql

+    {%- endcall %}
+    {% if execute %}
+      {% set select_changed_partitions %}
+          select distinct {{ partition_by|join(', ') }}


we can do something like this to get partitions
this way we are guaranteed to select all partitions in temp relation
and and will be faster

{% set partitions = get_partitions(relation) %} ... {% macro get_partitions(relation) %} {% set cluster = adapter.get_clickhouse_cluster_name() %} {% set source = 'system.parts' %} {% if cluster is not none %} {% set source = "cluster('{{ cluster }}', system.parts)" %} {% set sql -%} SELECT DISTINCT partition_id FROM {{ source }} WHERE active AND database = '{{ relation.schema }}' AND table = '{{ relation.identifier }}' {%- endset -%} {{ return(run_query(sql)) }} {% endmacro %}

dbt/include/clickhouse/macros/materializations/incremental.sql

bryzgaloff · 2023-11-17T20:34:39Z

Hi @simpl1g and @genzgd, I've wrapped up this PR for insert+replace strategy implementation: added some fancy documentation in the README and threw in integration tests (taking a page from lw-deletes). Motivation for the feature is given in the PR's description.

Some discussions above are unresolved, waiting on your pearls of wisdom there :)

Please take a peek at the PR when you get a chance, and hit me up with your thoughts. Hoping we can merge the feature soon and let the good times roll with the new strategy! :)

bryzgaloff · 2023-12-08T13:30:43Z

Hi @simpl1g and @genzgd! This is a kind reminder about my PR which is ready for your review. We have successfully battle-tested it internally ✅

We install my version from GitHub currently. It would be nice if you can approve it and release to PyPI. If any adjustments are required, please let me know! 🙏

genzgd · 2023-12-15T15:57:00Z

@bryzgaloff I apologize that we haven't yet had the resources to fully review this PR. As you may have noticed we've been focused on bug fixes and compatibility with the new dbt releases. Please know that we very much appreciate the contribution (especially with test cases and real world usage) and your work is next on the roadmap as we get time.

If you have a chance to resolve the conflicts over the next few weeks that would be appreciated and make the review just a bit easier.

dev-mkc19 · 2024-04-24T15:46:15Z

dbt/include/clickhouse/macros/materializations/incremental.sql

+        create table {{ intermediate_relation }} as {{ existing_relation }}
+    {%- endcall %}
+    {% call statement('insert_new_data') -%}
+        insert into {{ intermediate_relation }} select * from {{ new_data_relation }}


@bryzgaloff as you said you has a single node deployment. Here is should be {{ on_cluster_clause()}}. Otherwise error below occurs in my deployment. I'm currently working on adaptation to work on cluster.

15:04:39 :HTTPDriver for http://10.100.0.106:8123 returned response code 400) 15:04:39 Code: 36. DB::Exception: Macro 'uuid' and empty arguments of ReplicatedMergeTree are supported only for ON CLUSTER queries with Atomic database engine. (BAD_ARGUMENTS) (version 24.2.1.2248 (official build))

Also I have to configure model adding allow_nullable_key option because of otherwise another error occurs. Did you meet with that?

14:18:56 :HTTPDriver for http://10.100.0.106:8123 returned response code 400) 14:18:56 Code: 44. DB::Exception: There was an error on [10.100.0.106:9000]: Code: 44. DB::Exception: Partition key contains nullable columns, but merge tree setting `allow_nullable_key` is disabled. (ILLEGAL_COLUMN) (version 24.2.1.2248 (official build)). (ILLEGAL_COLUMN) (version 24.2.1.2248 (official build))

{{ config( materialized = "incremental", partition_by = "transaction_date_part", incremental_strategy = "insert+replace", engine = 'ReplicatedMergeTree', order_by = ['city_id', 'transaction_date_part'], schema = "analytics", settings = {'allow_nullable_key': 1} ) }}

Hi @dev-mkc19, thank you for your input! I plan to get back to working on the PR next week. For now, I will keep it without cluster support as I do not have quick-to-setup infrastructure to test it. Feel free to make your own PR adding cluster support 🤝 I may review it to ensure it does not break any of the insert+replace semantics. Tag me as a reviewer once it is published!

BentsiLeviav · 2024-07-03T06:04:47Z

@bryzgaloff
Thanks again for your contribution!
I would like to review your PR and merge it within the next few days. Can you sync your fork with the main repo? We upgraded dbt-core to 1.8.0.

After syncing I'll review this one right away.

bryzgaloff · 2024-07-05T13:03:55Z

Hi @BentsiLeviav thank you (and the other reviewers, of course!) for your participation. I am not actively using the plugin right now, but I may get back to handling your review feedback late next week. If the conflicts are not too critical, I might be able to resolve them quickly.

…ment insert+replace strategy TODO: convert partition_expression to ClickHouse literals

… nonempty partition_by

…ting affected partitions from system.parts This allows to avoid translation of agate.Row to ClickHouse literals.

…m.parts.partition_id -> partition partition_id is a String field with internal partition ID, it cannot be used in REPLACE PARTITION clause. "partition" field is a string representation of partition expression and can be used in a REPLACE PARTITION query as-is.

…ify partition by ID system.parts.partition field does not work for strings. ClickHouse allows to manipulate partitions referencing their IDs.

…sert+replace with tests

According to a PR review comments

bryzgaloff · 2024-07-15T15:05:28Z

Hi @BentsiLeviav I have rebased my contribution onto the latest main of this repository. Also, I have renamed the strategy to insert_replace. Please have a look! 👋

bryzgaloff · 2024-07-15T15:06:50Z

Does this repo has automated tests? I see there are workflows to be approved.

I do not have a quick infra to retest the contribution after the rebase, but if there is not testing workflow, I will perform the manual testing.

…r fix applied

bryzgaloff · 2024-07-22T12:15:16Z

Hi @BentsiLeviav and @genzgd — will you have a chance review and merge the PR soon? I have updated everything according to the review feedback and all the checks have passed. I would like to avoid the need to rebase it again 😅

BentsiLeviav · 2024-07-24T09:25:05Z

Hi @bryzgaloff
Huge thanks for this.
I will review and update you within the next few days.

bryzgaloff · 2024-07-29T13:28:14Z

Thank you @BentsiLeviav for the approval! What are the next steps for the PR to be merged?

BentsiLeviav · 2024-07-30T07:26:52Z

Hi @bryzgaloff
Sure, thank you for your contribution!

Before merging this, could you please add to the doc that this feature is experimental, and wasn't tested with cluster setup?
It is crucial to highlight these 2 points.

Once we are done with that, I'll merge your PR.

Thanks again for your work!

BentsiLeviav · 2024-08-01T10:49:17Z

Never mind, I'll take care of it :)

bryzgaloff · 2024-08-01T15:44:43Z

Thank you for all you help and merge, @BentsiLeviav! 🤝

bryzgaloff force-pushed the incremental-insert_replace branch from 262eaec to 143f599 Compare October 26, 2023 17:03

bryzgaloff commented Oct 26, 2023

View reviewed changes

dbt/include/clickhouse/macros/materializations/incremental.sql Outdated Show resolved Hide resolved

bryzgaloff commented Oct 26, 2023

View reviewed changes

dbt/include/clickhouse/macros/materializations/incremental.sql Outdated Show resolved Hide resolved

simpl1g reviewed Oct 26, 2023

View reviewed changes

bryzgaloff requested a review from simpl1g November 17, 2023 20:28

bryzgaloff changed the title ~~[WIP] insert+replace partitions incremental strategy~~ insert+replace partitions incremental strategy Nov 22, 2023

dev-mkc19 reviewed Apr 24, 2024

View reviewed changes

BentsiLeviav self-assigned this Jul 3, 2024

bryzgaloff added 9 commits July 15, 2024 16:12

dbt/include/clickhouse/macros/materializations/incremental.sql: imple…

ae4ca48

…ment insert+replace strategy TODO: convert partition_expression to ClickHouse literals

dbt/include/clickhouse/macros/materializations/incremental.sql: check…

b6cbca9

… nonempty partition_by

dbt/include/clickhouse/macros/materializations/incremental.sql: selec…

5c44c60

…ting affected partitions from system.parts This allows to avoid translation of agate.Row to ClickHouse literals.

dbt/include/clickhouse/macros/materializations/incremental.sql: ident…

add2ff8

…ify partition by ID system.parts.partition field does not work for strings. ClickHouse allows to manipulate partitions referencing their IDs.

README.md: inserts_only documented as overwriting incremental_strategy

b966767

README.md: insert+replace strategy described

6869ed2

tests/integration/adapter/incremental/test_incremental.py: covered in…

e4bd2cc

…sert+replace with tests

Renamed insert+replace strategy to insert_overwrite

616ea7a

According to a PR review comments

bryzgaloff force-pushed the incremental-insert_replace branch from 8940ca9 to 616ea7a Compare July 15, 2024 15:03

bryzgaloff requested a review from BentsiLeviav July 15, 2024 15:07

bryzgaloff changed the title ~~insert+replace partitions incremental strategy~~ insert_overwrite (insert+replace) partitions incremental strategy Jul 15, 2024

tests/integration/adapter/incremental/test_base_incremental.py: linte…

6383bef

…r fix applied

BentsiLeviav approved these changes Jul 29, 2024

View reviewed changes

BentsiLeviav merged commit 46904a4 into ClickHouse:main Aug 1, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insert_overwrite (insert+replace) partitions incremental strategy #201

insert_overwrite (insert+replace) partitions incremental strategy #201

bryzgaloff commented Oct 26, 2023 •

edited

Loading

CLAassistant commented Oct 26, 2023 •

edited

Loading

bryzgaloff commented Oct 26, 2023

bryzgaloff Oct 26, 2023 •

edited

Loading

simpl1g Oct 26, 2023

bryzgaloff Oct 27, 2023 •

edited

Loading

simpl1g Oct 27, 2023

bryzgaloff Nov 17, 2023 •

edited

Loading

simpl1g Oct 26, 2023

bryzgaloff commented Nov 17, 2023

bryzgaloff commented Dec 8, 2023

genzgd commented Dec 15, 2023

dev-mkc19 Apr 24, 2024

bryzgaloff Jul 5, 2024

BentsiLeviav commented Jul 3, 2024

bryzgaloff commented Jul 5, 2024

bryzgaloff commented Jul 15, 2024

bryzgaloff commented Jul 15, 2024 •

edited

Loading

bryzgaloff commented Jul 22, 2024 •

edited

Loading

BentsiLeviav commented Jul 24, 2024

bryzgaloff commented Jul 29, 2024

BentsiLeviav commented Jul 30, 2024

BentsiLeviav commented Aug 1, 2024

bryzgaloff commented Aug 1, 2024

insert_overwrite (insert+replace) partitions incremental strategy #201

insert_overwrite (insert+replace) partitions incremental strategy #201

Conversation

bryzgaloff commented Oct 26, 2023 • edited Loading

Summary

Checklist

CLAassistant commented Oct 26, 2023 • edited Loading

bryzgaloff commented Oct 26, 2023

bryzgaloff Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

simpl1g Oct 26, 2023

Choose a reason for hiding this comment

bryzgaloff Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

simpl1g Oct 27, 2023

Choose a reason for hiding this comment

bryzgaloff Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

simpl1g Oct 26, 2023

Choose a reason for hiding this comment

bryzgaloff commented Nov 17, 2023

bryzgaloff commented Dec 8, 2023

genzgd commented Dec 15, 2023

dev-mkc19 Apr 24, 2024

Choose a reason for hiding this comment

bryzgaloff Jul 5, 2024

Choose a reason for hiding this comment

BentsiLeviav commented Jul 3, 2024

bryzgaloff commented Jul 5, 2024

bryzgaloff commented Jul 15, 2024

bryzgaloff commented Jul 15, 2024 • edited Loading

bryzgaloff commented Jul 22, 2024 • edited Loading

BentsiLeviav commented Jul 24, 2024

bryzgaloff commented Jul 29, 2024

BentsiLeviav commented Jul 30, 2024

BentsiLeviav commented Aug 1, 2024

bryzgaloff commented Aug 1, 2024

bryzgaloff commented Oct 26, 2023 •

edited

Loading

CLAassistant commented Oct 26, 2023 •

edited

Loading

bryzgaloff Oct 26, 2023 •

edited

Loading

bryzgaloff Oct 27, 2023 •

edited

Loading

bryzgaloff Nov 17, 2023 •

edited

Loading

bryzgaloff commented Jul 15, 2024 •

edited

Loading

bryzgaloff commented Jul 22, 2024 •

edited

Loading