Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models #335

Closed
2 tasks done
danielmast opened this issue Nov 30, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request Stale

Comments

@danielmast
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Updating the cluster_by config on an incremental DBT model does not trigger an ALTER TABLE query on the table. Effect: the cluster keys of the Snowflake table are not updated.

Expected Behavior

Update the cluster_by config triggers an ALTER TABLE ... CLUSTER BY query in Snowflake, updating the cluster keys.

Steps To Reproduce

  1. Create a new model:
{{
    config(
        enabled=true,
        materialized='incremental',
        cluster_by=['key1', 'key2']
    )
}}

SELECT 1 as key1
,      2 as key2
,      3 as key3
  1. Run dbt so that the table is created
  2. Verify in Snowflake that the cluster keys are key1 and key2
  3. Update the cluster_by value to ['key1', 'key2', 'key3']
  4. Run dbt again
  5. Verify that a second row of data has been added
  6. Notice in Snowflake that the cluster keys are still key1 and key2

(For a regular (non-incremental) table model this problem doesn't occur, because the entire table is recreated everytime.)

Relevant log output

No response

Environment

- OS: Ubuntu 20.04.3 LTS
- Python: 3.8.10
- dbt-core: 1.3.1
- dbt-snowflake: 1.3.0

Which database adapter are you using with dbt?

snowflake

Additional Context

  • I'm willing to implement this change myself, but I first wanted to verify with the community whether this is indeed a missing functionality, or that I'm overseeing something.
  • I wasn't 100% sure if this issue belong in dbt-core or in dbt-snowflake. I chose for dbt-core because the original issue 'Support cluster by on Snowflake' was also in dbt-core, as well as the implementing PR.
@danielmast danielmast added bug Something isn't working triage labels Nov 30, 2022
@github-actions github-actions bot changed the title [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models Nov 30, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 1, 2022

@danielmast Thanks for opening! I am going to move this one over to dbt-snowflake, since I think that would be the right spot to implement the change. For now, we should talk about whether this is really a bug.

Currently, the alter table statement for clustering happens within the create_table_as macro. It's run always during the "table" materialization. In the "incremental" materialization, it happens in two cases:

  • when creating the model (table) for the very first time
  • when --full-refresh is enabled

{%- if cluster_by_string is not none -%}
select * from(
{{ compiled_code }}
) order by ({{ cluster_by_string }})
{%- else -%}

{% if cluster_by_string is not none and not temporary -%}
alter table {{relation}} cluster by ({{cluster_by_string}});
{%- endif -%}
{% if enable_automatic_clustering and cluster_by_string is not none and not temporary -%}
alter table {{relation}} resume recluster;
{%- endif -%}

Clustering a Snowflake table is more than just a metadata update — it requires actually changing the way the data is being stored on disk. That's why we include the order by clause when creating the table for the first time. Otherwise, the alter table ... cluster by / resume recluster would be forced to shuffle and redistribute all the data, right after having just written it.

My understanding is, adding a new clustering key during an incremental run would have the implicit effect of reshuffling all the preexisting data in the table — with significant costs in compute & time — a change that's better suited to an explicit --full-refresh.

I do see how it's confusing that the key is implicitly not added during the incremental run, though. We could look to detect the table's current clustering, and raise a warning if it differs from the one set in your config.

What do you think?

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Dec 1, 2022
@github-actions github-actions bot changed the title [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models [CT-1576] [CT-1572] [Bug] Snowflake cluster_by config does not update Snowflake table for incremental models Dec 1, 2022
@danielmast
Copy link
Author

@jtcohen6 Thank you for your reply. I fully agree with your argumentation, that reclustering is quite intensive, and that users should in some way be made aware of that.

I think the most intuitive strategy would be: when someone changes the clustering config on an "incremental" materialization: raise a warning but also perform the change.

Warning: Changing the cluster_by config causes the table to reshuffle. This can lead to significant compute costs.

(I'm not a native English speaker, so this text can probably be much better.)

I think it's most intuitive that a change of the cluster_by config does actually perform the change. I think it can be expected of someone using Snowflake's clustering functionality, that he/she is aware of the impact. Thinking that the clustering change has been applied where it hasn't may have worse outcomes, or frustrated users.

I'm reading in the docs that:

If an incremental model is configured to use table clustering, then dbt will also order the staged dataset before merging it into the destination table.

I assume that the staged dataset will be ordered by the newly configured cluster keys. So that order won't match with the actual clustering config of the table. More reason to perform the change, imo.

What do you think of this proposal?

@danielmast
Copy link
Author

I realize that this change should not only affect clustering behavior in Snowflake, but also in BigQuery, as they both use the CLUSTER BY functionality. Do you agree?

@jtcohen6 jtcohen6 self-assigned this Jan 12, 2023
@Fleid
Copy link
Contributor

Fleid commented Feb 11, 2023

@jtcohen6 I'm going to steal this one from you.

@danielmast I need to do some thinking here.

There is a larger topic emerging here, tied to materialized views (dynamic tables for Snowflake) and on_schema_change.
We may need additional settings, for a user to let dbt know what to do when a operation requires the full reprocessing of the model, instead of the expected incremental update. Should dbt force it, or skip, or fail.

I will get back to you next week.

@Fleid Fleid added enhancement New feature or request Refinement and removed triage bug Something isn't working labels Feb 11, 2023
@Fleid
Copy link
Contributor

Fleid commented Feb 11, 2023

Linking to dbt-labs/dbt-core#6911

@jtcohen6 jtcohen6 removed their assignment Feb 11, 2023
@danielmast
Copy link
Author

Thank you @Fleid . As you mentioned that this issue will be part of a larger topic, I'll await that discussion for now. Please let me know if I can help with implementation or something else.

@Fleid Fleid self-assigned this Mar 9, 2023
@Fleid
Copy link
Contributor

Fleid commented Mar 9, 2023

For MVs specifically, I just added a new parameter to the design called on_definition_change. It's the same principle as on_schema_change except it applies at the definition level - so operations that could trigger a full recompute are covered:

On dbt run:

  • If the definition is different from the current model, depending on on_definition_change
    • ignore : Do nothing
    • fail : Fail the run
    • alter_available : Use an ALTER statement to try to apply the change, some unsupported changes may be ignored
    • create_replace : Use a CREATE (opt. CREATE OR REPLACE) statement to re-create the table
  • Refresh when necessary

Not sure how this would interact with on_schema_change.

But more importantly, I still have a big question mark about the feasibility of certain options.

Let's say we add something similar to incremental models and we want to cover cluster_by keys via alter_available. That means that at run time, dbt needs to ask Snowflake what are the current cluster keys set on the table. It looks like it's possible, but I'm not 100% positive about it.

Curious about your thoughts on all that.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 6, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

3 participants