You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, atomic replacement with file_format: delta is now possible in the table materialization via create or replace table (Enable create or replace sql syntax #125)
It doesn't make conceptual sense for the incremental materialization to replace an entire table, if the --full-refresh flag is not being passed. If anything, an incremental model without partition_by or unique_key should instead be append only (insert into)—this is closer to how the materialization works on other databases.
Questions
Do we still need these set statements?
Yes, only for incremental_strategy: insert_overwrite + partition_by
I don't know if we need set spark.sql.hive.convertMetastoreParquet = false anymore / at all. It's not clear to me what this was doing.
Should incremental_strategy: insert_overwrite be supported at all for Delta tables?
If no: Should dbt raise a compilation error if incremental_strategy: insert_overwrite + file_format: delta? Or should we defer to Delta to raise an error (Table ... does not support dynamic overwrite in batch mode.;;), until such time as they add support for dynamic partition replacement?
Does incremental_strategy: insert_overwrite ever make sense without partition_by?
If yes: Should it be a full table replacement? Or should we create an "append-only" version that just runs insert (instead of insert overwrite) when no partitions are specified?
Should we add an "append-only" version of the merge strategy that runs if a unique_key is not specified, rather than raising a compilation error?
Default values: Across the board, dbt-spark has incremental_strategy: insert_overwrite + file_format: parquet as its defaults. Should we change those defaults to incremental_strategy: merge + file_format: delta if a user is connecting via target.method = 'odbc' (i.e. to Databricks)?
My current thinking
Yes, but we should move the set statement to only run if incremental_strategy: insert_overwrite + partition_by. We should raise a compilation error if incremental_strategy: insert_overwrite + target.endpoint.
No, we should raise a compilation error if incremental_strategy: insert_overwrite + file_format: delta.
Yes, the insert_overwrite strategy should be append-only (simple insert) if no partition_by is specified. Yes, the insert_overwrite strategy should continue replacing the entire table, as this is the standard behavior for INSERT OVERWRITE on Spark. However, a new strategy append will perform append-only inserts, partitions or no partitions; and it will be the default.
Yes, the merge strategy should work without unique_key, and change its merge condition to on false (as it is here): merge into [model] using [temp view] on false when not matched insert *.
Yes, I think that would make a lot of sense. We'd need to do a good job documenting this, though. This isn't possible in a straightforward way today. Instead, we'll change the default strategy to append.
The text was updated successfully, but these errors were encountered:
The default behavior of insert overwrite on Spark is to replace, not append. It's confusing to have a strategy called insert_overwrite that actually appends via insert into.
However, the default behavior of dbt incremental models (if no unique_key/partition_by specified) should be to append only, rather than replace/overwrite.
Therefore, I think we should create a new incremental strategy that only appends by running insert into. We could call it something like append, append_only, insert, insert_into. This strategy should be the default, as it works on all file formats, platforms, connection methods and is consistent with standard dbt behavior. Then, users are encouraged to switch to one of two main strategies:
file_format: parquet + incremental_strategy: insert_overwrite + partition_by. Will not be supported on Databricks SQL Endpoint because it requires set statements. If no partition_by is specified, replaces the table (i.e. still runs insert overwrite).
file_format: delta + incremental_strategy: merge + unique_key. Only supported on Databricks runtime. (Not yet working on Databricks SQL Endpoint because it requires create temp view, which is coming soon.) If no unique_key is specified, append only via merge (which should be functionally equivalent to the default append strategy).
Current behavior
set
statements:set spark.sql.sources.partitionOverwriteMode = DYNAMIC
set spark.sql.hive.convertMetastoreParquet = false
incremental_strategy: merge
requiresfile_format: delta
andunique_key
incremental_strategy: insert_overwrite
does not work withfile_format: delta
+partition_by
because Delta does not support dynamic partition overwrite ([Feature Request] support for dynamic partition overwrite delta-io/delta#348, Fixes #348 Support Dynamic Partition Overwrite delta-io/delta#371)incremental_strategy: insert_overwrite
withoutpartition_by
just atomically replaces the entire table. This was a possibility introduced by replace partitionOverwriteMode inside merge strategy #117.file_format: delta
is now possible in thetable
materialization viacreate or replace
table (Enable create or replace sql syntax #125)incremental
materialization to replace an entire table, if the--full-refresh
flag is not being passed. If anything, an incremental model withoutpartition_by
orunique_key
should instead be append only (insert into
)—this is closer to how the materialization works on other databases.Questions
set
statements?incremental_strategy: insert_overwrite
+partition_by
set spark.sql.hive.convertMetastoreParquet = false
anymore / at all. It's not clear to me what this was doing.set
statements (Databricks SQL Analytics endpoint does not supportset
statements #133), should we let this fail with an unhelpful error from Databricks, or raise our own compilation error?incremental_strategy: insert_overwrite
be supported at all for Delta tables?incremental_strategy: insert_overwrite
+file_format: delta
? Or should we defer to Delta to raise an error (Table ... does not support dynamic overwrite in batch mode.;;
), until such time as they add support for dynamic partition replacement?incremental_strategy: insert_overwrite
ever make sense withoutpartition_by
?insert
(instead ofinsert overwrite
) when no partitions are specified?merge
strategy that runs if aunique_key
is not specified, rather than raising a compilation error?dbt-spark
hasincremental_strategy: insert_overwrite
+file_format: parquet
as its defaults. Should we change those defaults toincremental_strategy: merge
+file_format: delta
if a user is connecting viatarget.method = 'odbc'
(i.e. to Databricks)?My current thinking
set
statement to only run ifincremental_strategy: insert_overwrite
+partition_by
. We should raise a compilation error ifincremental_strategy: insert_overwrite
+target.endpoint
.incremental_strategy: insert_overwrite
+file_format: delta
.Yes, theYes, theinsert_overwrite
strategy should be append-only (simpleinsert
) if nopartition_by
is specified.insert_overwrite
strategy should continue replacing the entire table, as this is the standard behavior forINSERT OVERWRITE
on Spark. However, a new strategyappend
will perform append-only inserts, partitions or no partitions; and it will be the default.merge
strategy should work withoutunique_key
, and change its merge condition toon false
(as it is here):merge into [model] using [temp view] on false when not matched insert *
.Yes, I think that would make a lot of sense. We'd need to do a good job documenting this, though.This isn't possible in a straightforward way today. Instead, we'll change the default strategy toappend
.The text was updated successfully, but these errors were encountered: