Incremental models do not fail when schema changes #226

jnatkins · 2021-10-08T21:29:48Z

Describe the bug

In other dbt adapters, when the schema changes for a source query of an incremental model, the model fails, requiring a --full-refresh (or alternatively, in 0.21, some handling with on_schema_change). However, in the Spark adapter, it appears that the incremental materialization does not check the schema, and so has no opportunity to handle a schema change.

Steps To Reproduce

As a relatively trivial example, I have a source for my incremental model:

{{ config(materialized = 'table') }}

{% for i in range(10) %}

  select {{ i+1 }} as id, to_date('{{ '2021-01-%02d' % (i+1) }}') as date_day {% if not loop.last %} union all {% endif %}

{% endfor %}

The actual incremental model looks like this:

{{ config(
    materialized = 'incremental',
    on_schema_change = 'fail')
}}

select *
from {{ ref('incremental_source') }}

{% if is_incremental() %}
  -- this filter will only be applied on an incremental run
  where date_day > (select max(date_day) from {{ this }})
{% endif %}

If I run this, using dbt run -m incremental_source+, the model is created as expected the first time. Now, let's make a slight modification to the incremental_source.sql file:

{{ config(materialized = 'table') }}

{% for i in range(10) %}

  select {{ i+1 }} as id, to_date('{{ '2021-01-%02d' % (i+1) }}') as date_day, 'foo' as new_col {% if not loop.last %} union all {% endif %}

{% endfor %}

All I've done is add a new_col column to each record. In other adapters, running this would cause the incremental to fail. In dbt-spark, it succeeds, and just ignores the change entirely. The result is that you get differing behavior in Spark from other relational warehouses, and on_schema_change has no effect.

Expected behavior

Incremental materializations fail, by default, in the event of a source query schema change.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

The output of dbt --version:
0.21.0

The operating system you're using:
dbt Cloud

The output of python --version:
N/A

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jnatkins · 2021-10-08T21:41:17Z

Worth noting that I discovered #198 after I submitted this, so can see that on_schema_change is not supported yet, but maybe there is still a bug here in terms of the inconsistent default behavior of what incrementals do when a schema change does occur.

jtcohen6 · 2021-10-14T12:01:19Z

@jnatkins Thanks for opening. The main reason that we're missing this functionality in dbt-spark==0.21.0 is that we need to call a few more macros within the spark-specific incremental materialization, for the sake of turning on on_schema_change functionality.

As far why this isn't in place yet, and the reason for my hesitation in #198: I think you're on the money. The default behavior is inconsistent between other databases and Spark/Databricks (specifically Delta). The reason is that merge with Delta is better at handling schema changes than most of the others.

On Delta, we can run:

    merge into <target>
      using <source>
      on <condition>
      when matched then update set *
      when not matched then insert *

Those * are powerful, and they handle cases where the column schemas differ between source and target. By default, Delta won't add/update columns to the target, but if schema evolution is enabled, it will.

So, on most other databases, the real on_schema_change features are all around append_new_columns and sync_all_columns, and the fail option is more or less synonymous with ignore. I think the real benefit on Spark/Databricks will be explicitly failing on fail, a.k.a. schema enforcement.

Next steps

Here's where I'm arriving at, conceptually, for the different on_schema_change options:

ignore: (default): dbt doesn't do anything, it's up to the database. Some data platforms (including Delta) offer built-in capabilities around schema evolution and enforcement, and those will take effect.
fail: dbt should explicitly enforce the schema and raise a compiler error.
append_new_columns: dbt should attempt to evolve the schema, but only in additive ways, without dropping any existing data.
sync_all_columns: dbt should attempt full schema evolution, including dropping data from no-longer-used columns. This may not be supported on all databases/platforms, e.g. Delta, which doesn't support alter table drop column: REPLACE COLUMNS unsupported? delta-io/delta#702 (comment)

Based on the way we implemented on_schema_change in dbt-core, I don't think the code changes to make that happen will be too complex. I'll open a quick PR that sketches out some of the initial changes.

What do you think?

jnatkins · 2021-10-14T14:29:09Z

@jtcohen6 The mention of spark.databricks.delta.schema.autoMerge.enabled is definitely worth thinking through, but what I've seen on some other tickets is that that config is session-based, and adding it as a pre-hook for a model does not seem to actually persist the function when the table is created. I'm not sure if there's a Spark-specific config option to turn this on for the actual connection that executes the DDL, but that would potentially be a solution there.

You can see #162 and #217 for more context there. Anecdotally, I've heard the same complaint from some other users that I've been working with. Is there a correct way to handle?

I'm open to either of these options (implementing something that creates parity between function on other warehouses -- this is probably a good idea since dbt provides a useful abstraction layer that obviates code changes when migrating platforms/future-proofing -- or solving at the platform level via first-class support for autoMerge in dbt-spark)

jtcohen6 · 2021-10-14T17:35:17Z

You can see #162 and #217 for more context there. Anecdotally, I've heard the same complaint from some other users that I've been working with. Is there a correct way to handle?

Yes, this is something we need to do a much better job documenting. It sounds like the pre-hook method doesn't work reliably, so there are two other approaches that should be more reliable:

Setting "spark.databricks.delta.schema.autoMerge.enabled": True when configuring the Spark cluster (e.g. for Databricks interactive clusters)
Including a new profile config, server_side_parameters (Add support for ODBC Server Side Parameters #201), that does the same. This is new in v0.21

jnatkins added bug Something isn't working triage labels Oct 8, 2021

jtcohen6 removed the triage label Oct 14, 2021

jtcohen6 mentioned this issue Oct 14, 2021

Add support for on_schema_change #229

Merged

7 tasks

jtcohen6 self-assigned this Oct 14, 2021

jtcohen6 mentioned this issue Oct 15, 2021

Schema evolution for databricks delta is not working using pre_hook #217

Closed

jtcohen6 closed this as completed in #229 Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental models do not fail when schema changes #226

Incremental models do not fail when schema changes #226

jnatkins commented Oct 8, 2021

jnatkins commented Oct 8, 2021

jtcohen6 commented Oct 14, 2021 •

edited

Loading

jnatkins commented Oct 14, 2021

jtcohen6 commented Oct 14, 2021

Incremental models do not fail when schema changes #226

Incremental models do not fail when schema changes #226

Comments

jnatkins commented Oct 8, 2021

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

Additional context

jnatkins commented Oct 8, 2021

jtcohen6 commented Oct 14, 2021 • edited Loading

Next steps

jnatkins commented Oct 14, 2021

jtcohen6 commented Oct 14, 2021

jtcohen6 commented Oct 14, 2021 •

edited

Loading