[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models #335

danielmast · 2022-11-30T14:08:00Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Updating the cluster_by config on an incremental DBT model does not trigger an ALTER TABLE query on the table. Effect: the cluster keys of the Snowflake table are not updated.

Expected Behavior

Update the cluster_by config triggers an ALTER TABLE ... CLUSTER BY query in Snowflake, updating the cluster keys.

Steps To Reproduce

Create a new model:

{{
    config(
        enabled=true,
        materialized='incremental',
        cluster_by=['key1', 'key2']
    )
}}

SELECT 1 as key1
,      2 as key2
,      3 as key3

Run dbt so that the table is created
Verify in Snowflake that the cluster keys are key1 and key2
Update the cluster_by value to ['key1', 'key2', 'key3']
Run dbt again
Verify that a second row of data has been added
Notice in Snowflake that the cluster keys are still key1 and key2

(For a regular (non-incremental) table model this problem doesn't occur, because the entire table is recreated everytime.)

Relevant log output

No response

Environment

- OS: Ubuntu 20.04.3 LTS
- Python: 3.8.10
- dbt-core: 1.3.1
- dbt-snowflake: 1.3.0

Which database adapter are you using with dbt?

snowflake

Additional Context

I'm willing to implement this change myself, but I first wanted to verify with the community whether this is indeed a missing functionality, or that I'm overseeing something.
I wasn't 100% sure if this issue belong in dbt-core or in dbt-snowflake. I chose for dbt-core because the original issue 'Support cluster by on Snowflake' was also in dbt-core, as well as the implementing PR.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-12-01T17:53:40Z

@danielmast Thanks for opening! I am going to move this one over to dbt-snowflake, since I think that would be the right spot to implement the change. For now, we should talk about whether this is really a bug.

Currently, the alter table statement for clustering happens within the create_table_as macro. It's run always during the "table" materialization. In the "incremental" materialization, it happens in two cases:

when creating the model (table) for the very first time
when --full-refresh is enabled

dbt-snowflake/dbt/include/snowflake/macros/adapters.sql

Lines 26 to 30 in 626d32e

    
           {%- if cluster_by_string is not none -%} 
        
             select * from( 
        
               {{ compiled_code }} 
        
               ) order by ({{ cluster_by_string }}) 
        
           {%- else -%}

dbt-snowflake/dbt/include/snowflake/macros/adapters.sql

Lines 34 to 39 in 626d32e

    
           {% if cluster_by_string is not none and not temporary -%} 
        
             alter table {{relation}} cluster by ({{cluster_by_string}}); 
        
           {%- endif -%} 
        
           {% if enable_automatic_clustering and cluster_by_string is not none and not temporary  -%} 
        
             alter table {{relation}} resume recluster; 
        
           {%- endif -%}

Clustering a Snowflake table is more than just a metadata update — it requires actually changing the way the data is being stored on disk. That's why we include the order by clause when creating the table for the first time. Otherwise, the alter table ... cluster by / resume recluster would be forced to shuffle and redistribute all the data, right after having just written it.

My understanding is, adding a new clustering key during an incremental run would have the implicit effect of reshuffling all the preexisting data in the table — with significant costs in compute & time — a change that's better suited to an explicit --full-refresh.

I do see how it's confusing that the key is implicitly not added during the incremental run, though. We could look to detect the table's current clustering, and raise a warning if it differs from the one set in your config.

What do you think?

danielmast · 2022-12-01T21:07:50Z

@jtcohen6 Thank you for your reply. I fully agree with your argumentation, that reclustering is quite intensive, and that users should in some way be made aware of that.

I think the most intuitive strategy would be: when someone changes the clustering config on an "incremental" materialization: raise a warning but also perform the change.

Warning: Changing the cluster_by config causes the table to reshuffle. This can lead to significant compute costs.

(I'm not a native English speaker, so this text can probably be much better.)

I think it's most intuitive that a change of the cluster_by config does actually perform the change. I think it can be expected of someone using Snowflake's clustering functionality, that he/she is aware of the impact. Thinking that the clustering change has been applied where it hasn't may have worse outcomes, or frustrated users.

I'm reading in the docs that:

If an incremental model is configured to use table clustering, then dbt will also order the staged dataset before merging it into the destination table.

I assume that the staged dataset will be ordered by the newly configured cluster keys. So that order won't match with the actual clustering config of the table. More reason to perform the change, imo.

What do you think of this proposal?

danielmast · 2022-12-16T08:33:31Z

I realize that this change should not only affect clustering behavior in Snowflake, but also in BigQuery, as they both use the CLUSTER BY functionality. Do you agree?

Fleid · 2023-02-11T00:26:15Z

@jtcohen6 I'm going to steal this one from you.

@danielmast I need to do some thinking here.

There is a larger topic emerging here, tied to materialized views (dynamic tables for Snowflake) and on_schema_change.
We may need additional settings, for a user to let dbt know what to do when a operation requires the full reprocessing of the model, instead of the expected incremental update. Should dbt force it, or skip, or fail.

I will get back to you next week.

Fleid · 2023-02-11T00:27:30Z

Linking to dbt-labs/dbt-core#6911

danielmast · 2023-03-08T08:44:33Z

Thank you @Fleid . As you mentioned that this issue will be part of a larger topic, I'll await that discussion for now. Please let me know if I can help with implementation or something else.

Fleid · 2023-03-09T05:44:11Z

For MVs specifically, I just added a new parameter to the design called on_definition_change. It's the same principle as on_schema_change except it applies at the definition level - so operations that could trigger a full recompute are covered:

On dbt run:

If the definition is different from the current model, depending on on_definition_change
- ignore : Do nothing
- fail : Fail the run
- alter_available : Use an ALTER statement to try to apply the change, some unsupported changes may be ignored
- create_replace : Use a CREATE (opt. CREATE OR REPLACE) statement to re-create the table
Refresh when necessary

Not sure how this would interact with on_schema_change.

But more importantly, I still have a big question mark about the feasibility of certain options.

Let's say we add something similar to incremental models and we want to cover cluster_by keys via alter_available. That means that at run time, dbt needs to ask Snowflake what are the current cluster keys set on the table. It looks like it's possible, but I'm not 100% positive about it.

Curious about your thoughts on all that.

github-actions · 2023-09-06T01:44:19Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-09-13T01:44:39Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

danielmast added bug Something isn't working triage labels Nov 30, 2022

github-actions bot changed the title ~~[Bug] Snowflake cluster_by config does not update Snowflake table for incremental models~~ [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models Nov 30, 2022

jtcohen6 added awaiting_response and removed triage labels Dec 1, 2022

jtcohen6 transferred this issue from dbt-labs/dbt-core Dec 1, 2022

github-actions bot changed the title ~~[CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models~~ [CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models Dec 1, 2022

jtcohen6 added triage and removed awaiting_response labels Jan 12, 2023

jtcohen6 self-assigned this Jan 12, 2023

Fleid added enhancement New feature or request Refinement and removed triage bug Something isn't working labels Feb 11, 2023

jtcohen6 removed their assignment Feb 11, 2023

Fleid self-assigned this Mar 9, 2023

github-actions bot added the Stale label Sep 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models #335

[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models #335

danielmast commented Nov 30, 2022

jtcohen6 commented Dec 1, 2022

danielmast commented Dec 1, 2022

danielmast commented Dec 16, 2022

Fleid commented Feb 11, 2023

Fleid commented Feb 11, 2023

danielmast commented Mar 8, 2023

Fleid commented Mar 9, 2023 •

edited

Loading

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 13, 2023

[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models #335

[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models #335

Comments

danielmast commented Nov 30, 2022

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

jtcohen6 commented Dec 1, 2022

danielmast commented Dec 1, 2022

danielmast commented Dec 16, 2022

Fleid commented Feb 11, 2023

Fleid commented Feb 11, 2023

danielmast commented Mar 8, 2023

Fleid commented Mar 9, 2023 • edited Loading

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 13, 2023

Fleid commented Mar 9, 2023 •

edited

Loading