support schema evolution for delta lake #162

laiyuanliu · 2021-04-29T16:10:41Z

Describe the feature

We are using DBT+Spark on Delta for incremental load. As we are getting data from various sources, one of the key features is to be able to support schema evolution. Delta lake does support it with the merge command as documented here
Can this be supported by DBT?

Describe alternatives you've considered

the current alternative is refresh all, this has too issues:

very time consuming
As part of ingesting the data, we keep the histories and some of our source data don't maintain history. the refresh all will lose this historic records

Who will this benefit?

I saw another issue #124 that were submitted for the similar case, but it was closed for some reason. supporting schema evolution with Delta will be extremely helpful for anyone who is using Delta incremental strategy.

gumartinm · 2021-04-29T17:26:42Z

I think it was implemented by means of CREATE OR REPLACE TABLE.
See this PR: #125

laiyuanliu · 2021-04-30T12:36:44Z

I'm using the latest dbt-spark 0.19.1 with merge incremental, and I still have the same issues reported in #125. when the select statement returns more columns, nothing is changed on the target table and it returns as successful.

Fokko · 2021-04-30T13:12:32Z

Hi @laiyuanliu

Thanks for opening this ticket. It depends on what your aim is. The MERGE INTO statement is currently supported. Please check from 17 minutes onwards: https://www.youtube.com/watch?v=zoHoIGE6tPc&t=527s

Schema evolution for full refreshes (i.e. non-incremental mode) is implemented in #125 in an atomic way. So we don't first drop the table, and then recreate the table. Otherwise, the table will be unavailable while recreating the table. However, when using the MERGE INTO statements, we're operating in incremental mode and this isn't supported yet. I see two options to fix this:

Detect a schema change, and then do the appropriate ALTER TABLE statements to get the schema in sync. This can be done using ADD COLUMNS and DROP COLUMNS. https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#add-columns However, this won't fill the historical values.
Detect a schema change. Drop the whole table, and rebuild it from the history. This will then be the same command, in the case the table didn't exist at all.

I would lean to the second one, this is more predictable and fits in nicely with the ELT way of thinking.

laiyuanliu · 2021-04-30T16:25:52Z

HI @Fokko, for delta format, can we just use the updateAll and insertAll statement supported by Databricks for schema evolution? it's documented here

Fokko · 2021-04-30T17:13:22Z

Thanks @laiyuanliu. I wasn't aware of that feature, thanks for pointing it out.

You should be able to test this using:

{{ config(
    pre_hook="SET spark.databricks.delta.schema.autoMerge.enabled=true"
) }}

Can you verify if this works? We could also integrate this easily into the codebase. Adding a block like:

{% call statement() %}
  set spark.databricks.delta.schema.autoMerge.enabled=true
{% endcall %}

in front of https://github.com/fishtown-analytics/dbt-spark/blob/6ad164b315748fef7c0ae0b87ff6b8292632f35e/dbt/include/spark/macros/materializations/incremental/incremental.sql#L34

laiyuanliu · 2021-04-30T18:37:23Z

set spark.databricks.delta.schema.autoMerge.enabled=true in pre-hook works like a charm.

with this setting, in incremental mode, new columns will be automatically added with default value null for the non-modified records.

more interestingly, if we remove some columns in our DBT code:
for existing records: it will only update the columns included in the select statement, and leave the non-included columns' value as is
for new records, the non-included columns' value is set to null

I feel this is perfect. will go ahead close the issue. thanks all for the help

cvsekhar · 2021-09-16T22:23:44Z

I have added the pre_hook but in incremental mode, new columns are not being added.

laiyuanliu added enhancement New feature or request triage labels Apr 29, 2021

laiyuanliu closed this as completed Apr 30, 2021

cvsekhar mentioned this issue Sep 16, 2021

Schema evolution for databricks delta is not working using pre_hook #217

Closed

jnatkins mentioned this issue Oct 14, 2021

Incremental models do not fail when schema changes #226

Closed

jtcohen6 removed the triage label Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support schema evolution for delta lake #162

support schema evolution for delta lake #162

laiyuanliu commented Apr 29, 2021 •

edited

Loading

gumartinm commented Apr 29, 2021 •

edited

Loading

laiyuanliu commented Apr 30, 2021

Fokko commented Apr 30, 2021

laiyuanliu commented Apr 30, 2021

Fokko commented Apr 30, 2021

laiyuanliu commented Apr 30, 2021

cvsekhar commented Sep 16, 2021

support schema evolution for delta lake #162

support schema evolution for delta lake #162

Comments

laiyuanliu commented Apr 29, 2021 • edited Loading

Describe the feature

Describe alternatives you've considered

Who will this benefit?

gumartinm commented Apr 29, 2021 • edited Loading

laiyuanliu commented Apr 30, 2021

Fokko commented Apr 30, 2021

laiyuanliu commented Apr 30, 2021

Fokko commented Apr 30, 2021

laiyuanliu commented Apr 30, 2021

cvsekhar commented Sep 16, 2021

laiyuanliu commented Apr 29, 2021 •

edited

Loading

gumartinm commented Apr 29, 2021 •

edited

Loading