-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Snowflake clustering #1591
Implement Snowflake clustering #1591
Conversation
merge 0.12.latest to master
Merge 0.12.latest into master
… dev/wilt-chamberlain
test to see if changes are taken into account
Very much WIP. Just wanted to check this approach is in a good direction. I put it as part of the main materialisations, but also very much open to making explicit separate materialisations for those like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice PR @bastienboutonnet! Dropped a couple of comments in here - please let me know if you have any questions!
plugins/snowflake/dbt/include/snowflake/macros/materializations/incremental.sql
Outdated
Show resolved
Hide resolved
{% macro snowflake__alter_cluster(relation, sql) -%} | ||
{# TODO: Add support for more than one keys #} | ||
{%- set cluster_by_keys = config.get('cluster_by') -%} | ||
alter table {{relation}} cluster by ({{cluster_by_keys}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was news to me when I learned it: altering the table's clustering keys doesn't actually recluster the table. You'd either need to run alter table ... recluster
(which is now deprecated), or you'd need to enable automatic reclustering for these changes to take effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn that's awkward. Indeed I just checked the status of the test table I created and automatic_clustering
is off.
The only option is running a alter table foo.bar resume recluster
but that will only work if a cluster by key has been provided. So doing clustering solely via order by
at table creation will not engage automatic housekeeping and will become stale eventually. Again, I may have gotten this completely wrong.
This kind of leaves us with the following option:
- mimic clustering on tables that you create with order by.
- for incremental tables hope you're using a fairly monotonic cluster by key because most likely Snowflake won't ever to any housekeeping for you later on.
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that either! I thought automatic clustering needed to be enabled explicitly.
So, that's kind of good news for us - it means that as long as we specify the clustering keys for the table, Snowflake will automatically recluster it as needed. This should never happen for table
models since dbt doesn't run DML against those models.
For incremental
models, each subsequent invocation of dbt should trigger a recluster of the table, which sounds like the right behavior to me!
In my experience, automatic reclustering can be pretty expensive! It might be worth adding a config to suspend automatic reclustering as shown in the docs. I'm not exactly sure if there's a use case for setting clustering keys and also suspending automatic reclustering... I think this is an area where I'd need to learn a little bit more, or we'd need to pull in a real expert (like @jtcohen6) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no, you were right and maybe I expressed myself badly. Atomatic clustering does need to be called explicitly to have Snowflake monitor and perform housekeeping-like reclustering from what I understand and have seen in operation on our db.
I have no idea what happens, if you give clustering keys, but leave automatic reclustering off. Would it just trigger on DMLs? In that case, yeah we don't need to worry. We coulda alter the keys, and bank on the DMLs to trigger reclustering.
We have a bunch of tables on which we have automatic clustering on. Clustered by created_at
kind of timestamps, and in which we load incrementally. At first run automatic clustering was "expensive" as it was the first run, but on subsequent updates it got a lot cheaper since it only had to do incremental housekeeping. Or at least that's what we see. But my feeling is that if autoclustering is off, having a cluster by key isn't going to do anything which means we'd have to also make sure autoclustering is turned on in the tables that we want it on.
But yeah I have no idea if I'm right and I'd love to hear from someone who knows Snowflake in and out a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest re reading both the https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html and https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html confuses me. Somehow it's not clear whether you'll ever need automatic clustering but at the same time it seems strange that snowflake would keep things clustered without it turned on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drewbanin I just checked with my boss who has been using clustering a bit in Snowflake and his take on this is that, if you want to benefit from incremental re clustering, you need to specify the clustering keys and turn on reclustering on the table. The order by will take care of loading the data in the right order but if you do any further loading and you have clustering keys such as user_id
, created_at
incremental loads wouldn't take care of making the user_id
be correctly placed so your data would become progressively un clustered and you'd have to do a full-refresh with the order by to re arrange things in the right place.
I just made some changes to reflect this logic. Note that we only need to turn on clustering on incremental tables. For simple tables since they are always fully refreshed the order by is sufficient.
Let me know if you find out conflicting information but, to me, what I did seems to make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in weighing in! I don't think I'm going to say anything novel, but in the process of saying it, I lend my endorsement to this work, and significant appreciation to @bastienboutonnet.
I may be misremembering—because the last time I dived deeply into this was several months ago when Manual Reclustering was un-deprecated and Automatic Reclustering was still a preview feature—but I believe my memory echoes @drewbanin here:
it means that as long as we specify the clustering keys for the table, Snowflake will automatically recluster it as needed.
If Automatic Reclustering is enabled for a Snowflake account, as soon as a table is defined with cluster keys, Snowflake will start keeping tabs on it.
This should never happen for table models since dbt doesn't run DML against those models.
We found out that both DML (inserting/updating/deleting rows from an existing clustered table) and DDL (creating a new table with cluster keys) could trigger automatic clustering, if Snowflake identified that the newly created table was not optimally clustered, based on the keys defined for it post hoc
. This ended up being the primary expense of our experimentation :) and it's avoided, as this PR does, by creating the tables with an equivalent order by
. That's this line from the top of the docs:
Note that, after a clustered table is defined, reclustering does not necessarily start immediately. Snowflake only reclusters a clustered table if it will benefit from the operation.
The correctly ordered table has nothing to benefit from the operation; automatic clustering is still enabled, but finds itself with nothing to do.
For incremental models, each subsequent invocation of dbt should trigger a recluster of the table, which sounds like the right behavior to me!
Right on. As long as the original table is properly ordered, as are any subsequent temp tables for merging, the subsequent automatic clustering can be a fairly performant, fairly cheap interleaving.
So my only practical thought here is that, per my understanding of late 2018, the automatic_clustering
config and alter table {{relation}} resume recluster
DDL is not actually needed. When automatic clustering is turned on for the account, any table with cluster keys defined is presumed to have auto-clustering resumed until it is explicitly suspended.
That said, it's a nice idea for dbt to be verbose-and-beyond in its operations, especially ones this subtle, and so I've similarly included the post-hook in past implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply @jtcohen6 ❤. I think your comments make sense and indeed confirm most of the things we wanted to check. Regarding turning clustering on at the table level (after of course having it enabled for the account for it to work) we found that if we do not run it, the automatic_clustering
column for a table for which we have given clustering keys is marked as OFF
which suggests that automatic clustering may not be performed as you suggest. I'm curious how you had tested it previously.
As you say we don't risk a lot by being extra explicit and I actually like to be extra explicit but I'm curious to really make sure I understand before just hiding behind a "it won't hurt" logic. Always better not to implement stuff that's un-necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bastienboutonnet That could be! That morsel of knowledge was one I had picked up when this feature was in private release many months ago, and it struck me as odd at the time that automatic_clustering would be enabled on tables, by default, as soon as a cluster key was defined. So I'm not surprised and actually pleased to hear that the switch has flipped.
Snowflake's docs on the matter are still ambiguous IMO, so I think we err well on the side of being extra un-ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! then I guess it all makes sense. Yes the documentation around that feature is hella ambiguous. I should raise it as an issue to our technical rep or on the snowflake community as I think it probably leaves more than us confused.
I think you'll want to add dbt uses this list of configs to apply settings from the |
@@ -1,14 +1,33 @@ | |||
{% macro snowflake__create_table_as(temporary, relation, sql) -%} | |||
{%- set transient = config.get('transient', default=true) -%} | |||
{# TODO: Add support for more than one keys #} | |||
{%- set cluster_by_keys = config.get('cluster_by') -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like using a pattern like this which allows users to either specify:
- a string
- a list of strings
It's a tiny bit of extra work to marshal a string into a list that contains one string, but it's worth the UX benefit IMO!
{{ sql }} | ||
{%- endif %} | ||
); | ||
{% if cluster_by_keys is not none and is_incremental -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this check on is_incremental
here. Instead, I think it might make more sense to add a Snowflake-specific config called automatic_clustering
which controls whether or not dbt enables automatic clustering. In particular, I think this will be useful in conjunction with dev/prod environments. It's probably not super desirable to always enable automatic clustering in dev (it can be expensive!). I think it's less important to check that the model is incremental
, as automatic clustering simply won't run for non-incremental models (like tables
) because dbt never runs DML against them!
You buy this?
I'm not even sure that we need to add this check -- if the user specifies a cluster_by
for a non-incremental model (like a table
), then there shouldn't be any harm in turning on automatic clustering.
I am open to adding another config for Snowflake, like automatic_clustering
which controls specifically when the alter table {{relation}} resume recluster;
query is executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I buy it! It makes sense that if there is no DML then no automatic clustering should be triggered, but I'm not sure if it wouldn't actually try to cluster things when you turn it on, which could be costly but I'm not sure. We'd have to have a snowflake specialist confirm this.
So, yes, I think putting alter table {{relation}} resume recluster;
behind a config is a good safety nest and it gives transparency to the user about what is going on.
Curious, how would you implement an override of automatic clustering if you were running your model against a --target: dev
for example? You're hinting that that's configurable in the dbt_project.yml
? or would you put an override in your profile.yml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we discussed on Slack, but for posterity:)
Yeah - that’s a separate issue that would be really good to address soon. When dbt evaluates the models:
block in the dbt_project.yml
file, it doesn’t include the specified target
in the context. So, it would be great to do:
models:
automatic_clustering: "{{ target.name == 'prod' }}"
but unfortunately, target
is going to evaluate to none
in this context currently, so that won’t work at the moment. Unsure if we have an issue for this at the moment, but I can create one
@drewbanin I just implemented the changes you suggested. I think they make a lot of sense. Have a look and let me know whether it makes sense. I just realised my python formatter (black) caused a few stylistic diff. If you would rather not have these changes I can go back and undo it. I'm also gonna go ahead and turn the PR into a "ready for review" state as I think we're not going to debate much implementation logic so it should be a proper review. Also, the first time I updated my fork I brought master in. Let me know if you'd like me to cherry pick those commits or something like that |
…into feature/snowflake_cluster_by
{{ sql }} | ||
{%- endif %} | ||
); | ||
{% if cluster_by_keys is not none and is_incremental -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we discussed on Slack, but for posterity:)
Yeah - that’s a separate issue that would be really good to address soon. When dbt evaluates the models:
block in the dbt_project.yml
file, it doesn’t include the specified target
in the context. So, it would be great to do:
models:
automatic_clustering: "{{ target.name == 'prod' }}"
but unfortunately, target
is going to evaluate to none
in this context currently, so that won’t work at the moment. Unsure if we have an issue for this at the moment, but I can create one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @bastienboutonnet! Really nice work :)
One tiny change requested, but this should be good to merge after that!
@jtcohen6 any chance you'll be able to take a quick peek at this on Monday?
Awesome! Thanks for your help with that one too! (gosh where is the fidget spinner emoji when we need it) |
Looks like a couple of
|
@bastienboutonnet the tests are all passing here! I'm going to just do a couple more sanity checks on this over the weekend, but I think we should be a-ok to merge this early next week. If everything looks good, it'll ship for our imminent 0.14.1 release. Thanks also to @jtcohen6 for imparting his wisdom here! 🎉 ❄️ ✨ |
{%- set enable_automatic_clustering = config.get('automatic_clustering', default=false) -%} | ||
{%- if cluster_by_keys is not none and cluster_by_keys is string -%} | ||
{%- set cluster_by_keys = [cluster_by_keys] -%} | ||
{%- set cluster_by_string = cluster_by_keys|join(", ")-%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did one last scan here - I think this won't work properly if cluster_by_keys
is provided as a list. If that happens, then the code in this branch won't be executed. I think we should move this line outside of the if block here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh dear... good catch! I can certainly fix that! BUT I just realised to cleaned my fork so I don't actually know how to can edit this on my side. I had to fork dbt again as we were experimenting and forgot this was still pending. Do you know what would be the best way to go? I could probably fork again, or do a new PR but then we'd loose the trace of our discussion (although we could refer to it even if the PR is closed). Let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh dear indeed! I've never seen this before!
There's a cool thing you can do on GitHub - if you add .patch
to the end of a PR url, you'll get a "patch" for the change. I think you should be able to make a clean branch off of dev/0.14.1
, then do something like:
curl https://patch-diff.githubusercontent.com/raw/fishtown-analytics/dbt/pull/1591.patch > 1591.patch
git apply 1591.patch
If git says that it can't apply the patch, you can try applying the patch from the commit 9e36ebd
.
Let me know how it goes!
Closing this one due to having had to create a new one after deleting my fork... New PR: #1689 |
Aim
Brings clustering to snowflake.
order by
via wrapping the SQL in aselect * from ( {{sql}} ) order by ( {{cluster_by_keys}}
alter table {{relation}} cluster by ({{cluster_by_keys}})
to leverage Snowflake's automatic clustering.This approach was discussed in #634