Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace partitionOverwriteMode inside merge strategy #117

Merged

Conversation

charlottevdscheun
Copy link
Contributor

Problem:
Using incremental materialization with a insert_overwrite strategy without a unique_key doesn't work.

{{
  config(
    materialized='incremental',
    file_format='delta',
    incremental_strategy='insert_overwrite'
  )
}}

Error message
Table charlotte.test does not support dynamic overwrite in batch mode.

Fix
Relocating the "spark.sql.sources.partitionOverwriteMode = DYNAMIC" inside the merge strategy. Because this is only necessary when you give a unique_key, and for insert_overwrite this is not a must, but for merge it is.

@jtcohen6
Copy link
Contributor

Hey @charlottevdscheun, long time! :)

I think the issue here is that your model does not have a partition_by config, rather than lacking a unique_key. The insert_overwrite strategy assumes that you have a partitioned table, that you want to fully replace a handful of partitions, and that Spark will dynamically determine the partitions for replacement based on the values returned by your incremental model SQL. E.g. you have a table partitioned by a date column, you use an is_incremental() filter to limit model results to the last three days, and fully replace the last three days of data.

To my knowledge, running insert overwrite on a non-partitioned table will atomically replace all the contents of the table (docs). That isn't really what we're after with incremental models. Is that what you need for your use case?

@charlottevdscheun
Copy link
Contributor Author

Hee @jtcohen6, yeah I thought hey lets follow in Fokko's footsteps and contribute too ;)

I'll explain my usecase: We have a problem in our project, if our pipeline runs, it drops and recreates the end table. Leading to a 15 minute unavailable table for our model. Thats why we started using the delta file so that we could overwrite the table and it would still be available to our model.

I saw that that function was already available in the incremental materialization, but for it to work without a partition or unique key it doesnt want the set spark.sql.sources.partitionOverwriteMode = DYNAMIC. I completely forgot about the partitioned colloms, but i changed the code to if there is a unique key or partition col then overwrite mode gets turned on.

I think this overwrite strategy maybe could be a strategy function outside the incremental materialization, what do you think?

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tnx @charlottevdscheun for fixing this!

dbt/include/spark/macros/materializations/incremental.sql Outdated Show resolved Hide resolved
Fokko's suggestion to remove unique key from the if statement

Co-authored-by: Fokko Driesprong <fokko@driesprong.frl>
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shipit

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 2, 2020

We have a problem in our project, if our pipeline runs, it drops and recreates the end table. Leading to a 15 minute unavailable table for our model. Thats why we started using the delta file so that we could overwrite the table and it would still be available to our model.

Ok, this makes sense! I agree with keeping this approach as part of the incremental materialization, because it still has a crucial limitation of incremental builds: if the column names or data types change, if there are more or fewer columns, the insert overwrite won't work.

I think we'll want to explain this as: "If a partition_by config is not specified, dbt will overwrite the entire table as an atomic operation, replacing it with new data of the same schema. This is analogous to truncate + insert."

@charlottevdscheun Could I trouble you to add that piece to the "Incremental Models" section of the README as part of this PR? I can take care of updating the primary docs before the next dbt-spark release.

@charlottevdscheun
Copy link
Contributor Author

@jtcohen6 No trouble at all! Thank you for looking at my first PR ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants