Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support delta lake format #30

Closed
tongqqiu opened this issue Sep 20, 2019 · 4 comments
Closed

Support delta lake format #30

tongqqiu opened this issue Sep 20, 2019 · 4 comments

Comments

@tongqqiu
Copy link

Delta format support normal merging
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

Wish that we can support something like this

{{ config(
    materialized='incremental',
    file_format='delta'
) }}
@drewbanin
Copy link
Contributor

Thanks for the suggestion @tongqqiu! I love the idea of being able to use delta-specific DML in an execution environment like DataBricks.

dbt has the ability to define incremental strategies that define how incremental models should be build. I imagine the default could be insert_overwrite, but users could configure their models to use merge instead. I like that this would support both vanilla spark, as well as databricks runtimes.

So, the work to do here is really just adding the merge logic to the incremental flow. That should look something like: https://github.com/fishtown-analytics/dbt/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/common/merge.sql#L12-L35

Is this something you're interested in contributing? We're super happy to help out if so!

@tongqqiu
Copy link
Author

@drewbanin When model is a "table", the current behavior is to drop and create the table. Since spark doesn't support the transaction, it is not good to drop the table first. The alternative way is to use "Insert into overwrite" statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html. It is similar what you did for incremental type, just don't need partitions. It will keep the table live, and delta format will ensure ACID on a single table level as well. Any suggests how to make that change? BTW, set file format as delta works well like default parquet.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 8, 2020

Hey @tongqqiu, to follow up on this issue:

As far as the table materialization:

  • I hear your point about wanting to use insert overwrite instead of drop + create for atomic table replacement. We discussed a bit more here.
  • The main issue with using insert_overwrite in the general case is that it cannot handle changes to column names or data types. One of the core propositions of the table materialization is that it fully wipes the slate and creates the model from scratch, no matter whether/what the preexisting version looked like.
  • I think the atomic replacement you're suggesting is possible with the dbt-spark plugin today: If you know your model will not undergo any structural change, you could materialize the model as incremental, pick any arbitrary column to partition by, and re-select all the data in every run.
  • In the long run, I believe our best answer is to use create or replace table, which we understand to be coming in Spark 3.0.

@tongqqiu
Copy link
Author

tongqqiu commented Apr 8, 2020

@jtcohen6 Sounds all good to me.

@tongqqiu tongqqiu closed this as completed Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants