Rationalize file formats + incremental strategies #140

jtcohen6 · 2021-01-12T12:21:27Z

Current behavior

All incremental strategies run two set statements:
- set spark.sql.sources.partitionOverwriteMode = DYNAMIC
- set spark.sql.hive.convertMetastoreParquet = false
incremental_strategy: merge requires file_format: delta and unique_key
incremental_strategy: insert_overwrite does not work with file_format: delta + partition_by because Delta does not support dynamic partition overwrite ([Feature Request] support for dynamic partition overwrite delta-io/delta#348, Fixes #348 Support Dynamic Partition Overwrite delta-io/delta#371)
incremental_strategy: insert_overwrite without partition_by just atomically replaces the entire table. This was a possibility introduced by replace partitionOverwriteMode inside merge strategy #117.
- However, atomic replacement with file_format: delta is now possible in the table materialization via create or replace table (Enable create or replace sql syntax #125)
- It doesn't make conceptual sense for the incremental materialization to replace an entire table, if the --full-refresh flag is not being passed. If anything, an incremental model without partition_by or unique_key should instead be append only (insert into)—this is closer to how the materialization works on other databases.

Questions

Do we still need these set statements?
- Yes, only for incremental_strategy: insert_overwrite + partition_by
- I don't know if we need set spark.sql.hive.convertMetastoreParquet = false anymore / at all. It's not clear to me what this was doing.
- Given that the SQL Endpoint does not support these set statements (Databricks SQL Analytics endpoint does not support set statements #133), should we let this fail with an unhelpful error from Databricks, or raise our own compilation error?
Should incremental_strategy: insert_overwrite be supported at all for Delta tables?
- If no: Should dbt raise a compilation error if incremental_strategy: insert_overwrite + file_format: delta? Or should we defer to Delta to raise an error (Table ... does not support dynamic overwrite in batch mode.;;), until such time as they add support for dynamic partition replacement?
Does incremental_strategy: insert_overwrite ever make sense without partition_by?
- If no: We should raise a compilation error, as we did before replace partitionOverwriteMode inside merge strategy #117
- If yes: Should it be a full table replacement? Or should we create an "append-only" version that just runs insert (instead of insert overwrite) when no partitions are specified?
Should we add an "append-only" version of the merge strategy that runs if a unique_key is not specified, rather than raising a compilation error?
Default values: Across the board, dbt-spark has incremental_strategy: insert_overwrite + file_format: parquet as its defaults. Should we change those defaults to incremental_strategy: merge + file_format: delta if a user is connecting via target.method = 'odbc' (i.e. to Databricks)?

My current thinking

Yes, but we should move the set statement to only run if incremental_strategy: insert_overwrite + partition_by. We should raise a compilation error if incremental_strategy: insert_overwrite + target.endpoint.
No, we should raise a compilation error if incremental_strategy: insert_overwrite + file_format: delta.
~~Yes, the insert_overwrite strategy should be append-only (simple insert) if no partition_by is specified.~~ Yes, the insert_overwrite strategy should continue replacing the entire table, as this is the standard behavior for INSERT OVERWRITE on Spark. However, a new strategy append will perform append-only inserts, partitions or no partitions; and it will be the default.
Yes, the merge strategy should work without unique_key, and change its merge condition to on false (as it is here): merge into [model] using [temp view] on false when not matched insert *.
~~Yes, I think that would make a lot of sense. We'd need to do a good job documenting this, though.~~ This isn't possible in a straightforward way today. Instead, we'll change the default strategy to append.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-01-13T14:56:12Z

@franloza made some great points in a related Slack thread:

The default behavior of insert overwrite on Spark is to replace, not append. It's confusing to have a strategy called insert_overwrite that actually appends via insert into.
However, the default behavior of dbt incremental models (if no unique_key/partition_by specified) should be to append only, rather than replace/overwrite.

Therefore, I think we should create a new incremental strategy that only appends by running insert into. We could call it something like append, append_only, insert, insert_into. This strategy should be the default, as it works on all file formats, platforms, connection methods and is consistent with standard dbt behavior. Then, users are encouraged to switch to one of two main strategies:

file_format: parquet + incremental_strategy: insert_overwrite + partition_by. Will not be supported on Databricks SQL Endpoint because it requires set statements. If no partition_by is specified, replaces the table (i.e. still runs insert overwrite).
file_format: delta + incremental_strategy: merge + unique_key. Only supported on Databricks runtime. (Not yet working on Databricks SQL Endpoint because it requires create temp view, which is coming soon.) If no unique_key is specified, append only via merge (which should be functionally equivalent to the default append strategy).

jtcohen6 added the type:enhancement New feature or request label Jan 12, 2021

jtcohen6 mentioned this issue Jan 12, 2021

Rationalize incremental materialization #141

Merged

5 tasks

jtcohen6 closed this as completed in #141 Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize file formats + incremental strategies #140

Rationalize file formats + incremental strategies #140

jtcohen6 commented Jan 12, 2021 •

edited

Loading

jtcohen6 commented Jan 13, 2021

Rationalize file formats + incremental strategies #140

Rationalize file formats + incremental strategies #140

Comments

jtcohen6 commented Jan 12, 2021 • edited Loading

Current behavior

Questions

My current thinking

jtcohen6 commented Jan 13, 2021

jtcohen6 commented Jan 12, 2021 •

edited

Loading