Support delta lake format #30

tongqqiu · 2019-09-20T19:26:53Z

Delta format support normal merging
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

Wish that we can support something like this

{{ config(
    materialized='incremental',
    file_format='delta'
) }}

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-09-23T15:36:07Z

Thanks for the suggestion @tongqqiu! I love the idea of being able to use delta-specific DML in an execution environment like DataBricks.

dbt has the ability to define incremental strategies that define how incremental models should be build. I imagine the default could be insert_overwrite, but users could configure their models to use merge instead. I like that this would support both vanilla spark, as well as databricks runtimes.

So, the work to do here is really just adding the merge logic to the incremental flow. That should look something like: https://github.com/fishtown-analytics/dbt/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/common/merge.sql#L12-L35

Is this something you're interested in contributing? We're super happy to help out if so!

tongqqiu · 2020-02-28T19:24:39Z

@drewbanin When model is a "table", the current behavior is to drop and create the table. Since spark doesn't support the transaction, it is not good to drop the table first. The alternative way is to use "Insert into overwrite" statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html. It is similar what you did for incremental type, just don't need partitions. It will keep the table live, and delta format will ensure ACID on a single table level as well. Any suggests how to make that change? BTW, set file format as delta works well like default parquet.

jtcohen6 · 2020-04-08T15:25:46Z

Hey @tongqqiu, to follow up on this issue:

merge as an incremental strategy was added in 0.15.3 upgrade #65 and included in the 0.15.3 release
You have an open PR (Add snapshot support #66) to use the Delta merge functionality to add support for dbt snapshot

As far as the table materialization:

I hear your point about wanting to use insert overwrite instead of drop + create for atomic table replacement. We discussed a bit more here.
The main issue with using insert_overwrite in the general case is that it cannot handle changes to column names or data types. One of the core propositions of the table materialization is that it fully wipes the slate and creates the model from scratch, no matter whether/what the preexisting version looked like.
I think the atomic replacement you're suggesting is possible with the dbt-spark plugin today: If you know your model will not undergo any structural change, you could materialize the model as incremental, pick any arbitrary column to partition by, and re-select all the data in every run.
In the long run, I believe our best answer is to use create or replace table, which we understand to be coming in Spark 3.0.

tongqqiu · 2020-04-08T22:05:36Z

@jtcohen6 Sounds all good to me.

tongqqiu closed this as completed Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support delta lake format #30

Support delta lake format #30

tongqqiu commented Sep 20, 2019

drewbanin commented Sep 23, 2019

tongqqiu commented Feb 28, 2020

jtcohen6 commented Apr 8, 2020

tongqqiu commented Apr 8, 2020

Support delta lake format #30

Support delta lake format #30

Comments

tongqqiu commented Sep 20, 2019

drewbanin commented Sep 23, 2019

tongqqiu commented Feb 28, 2020

jtcohen6 commented Apr 8, 2020

tongqqiu commented Apr 8, 2020