Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support schema evolution for delta lake #162

Closed
laiyuanliu opened this issue Apr 29, 2021 · 7 comments
Closed

support schema evolution for delta lake #162

laiyuanliu opened this issue Apr 29, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@laiyuanliu
Copy link

laiyuanliu commented Apr 29, 2021

Describe the feature

We are using DBT+Spark on Delta for incremental load. As we are getting data from various sources, one of the key features is to be able to support schema evolution. Delta lake does support it with the merge command as documented here
Can this be supported by DBT?

Describe alternatives you've considered

the current alternative is refresh all, this has too issues:

  1. very time consuming
  2. As part of ingesting the data, we keep the histories and some of our source data don't maintain history. the refresh all will lose this historic records

Who will this benefit?

I saw another issue #124 that were submitted for the similar case, but it was closed for some reason. supporting schema evolution with Delta will be extremely helpful for anyone who is using Delta incremental strategy.

@laiyuanliu laiyuanliu added enhancement New feature or request triage labels Apr 29, 2021
@gumartinm
Copy link

gumartinm commented Apr 29, 2021

I think it was implemented by means of CREATE OR REPLACE TABLE.
See this PR: #125

@laiyuanliu
Copy link
Author

I'm using the latest dbt-spark 0.19.1 with merge incremental, and I still have the same issues reported in #125. when the select statement returns more columns, nothing is changed on the target table and it returns as successful.

@Fokko
Copy link
Contributor

Fokko commented Apr 30, 2021

Hi @laiyuanliu

Thanks for opening this ticket. It depends on what your aim is. The MERGE INTO statement is currently supported. Please check from 17 minutes onwards: https://www.youtube.com/watch?v=zoHoIGE6tPc&t=527s

Schema evolution for full refreshes (i.e. non-incremental mode) is implemented in #125 in an atomic way. So we don't first drop the table, and then recreate the table. Otherwise, the table will be unavailable while recreating the table. However, when using the MERGE INTO statements, we're operating in incremental mode and this isn't supported yet. I see two options to fix this:

I would lean to the second one, this is more predictable and fits in nicely with the ELT way of thinking.

@laiyuanliu
Copy link
Author

HI @Fokko, for delta format, can we just use the updateAll and insertAll statement supported by Databricks for schema evolution? it's documented here

@Fokko
Copy link
Contributor

Fokko commented Apr 30, 2021

Thanks @laiyuanliu. I wasn't aware of that feature, thanks for pointing it out.

You should be able to test this using:

{{ config(
    pre_hook="SET spark.databricks.delta.schema.autoMerge.enabled=true"
) }}

Can you verify if this works? We could also integrate this easily into the codebase. Adding a block like:

{% call statement() %}
  set spark.databricks.delta.schema.autoMerge.enabled=true
{% endcall %}

in front of https://github.com/fishtown-analytics/dbt-spark/blob/6ad164b315748fef7c0ae0b87ff6b8292632f35e/dbt/include/spark/macros/materializations/incremental/incremental.sql#L34

@laiyuanliu
Copy link
Author

set spark.databricks.delta.schema.autoMerge.enabled=true in pre-hook works like a charm.

with this setting, in incremental mode, new columns will be automatically added with default value null for the non-modified records.

more interestingly, if we remove some columns in our DBT code:
for existing records: it will only update the columns included in the select statement, and leave the non-included columns' value as is
for new records, the non-included columns' value is set to null

I feel this is perfect. will go ahead close the issue. thanks all for the help

@cvsekhar
Copy link

I have added the pre_hook but in incremental mode, new columns are not being added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants