Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental models do not fail when schema changes #226

Closed
jnatkins opened this issue Oct 8, 2021 · 4 comments · Fixed by #229
Closed

Incremental models do not fail when schema changes #226

jnatkins opened this issue Oct 8, 2021 · 4 comments · Fixed by #229
Assignees
Labels
bug Something isn't working

Comments

@jnatkins
Copy link

jnatkins commented Oct 8, 2021

Describe the bug

In other dbt adapters, when the schema changes for a source query of an incremental model, the model fails, requiring a --full-refresh (or alternatively, in 0.21, some handling with on_schema_change). However, in the Spark adapter, it appears that the incremental materialization does not check the schema, and so has no opportunity to handle a schema change.

Steps To Reproduce

As a relatively trivial example, I have a source for my incremental model:

{{ config(materialized = 'table') }}

{% for i in range(10) %}

  select {{ i+1 }} as id, to_date('{{ '2021-01-%02d' % (i+1) }}') as date_day {% if not loop.last %} union all {% endif %}

{% endfor %}

The actual incremental model looks like this:

{{ config(
    materialized = 'incremental',
    on_schema_change = 'fail')
}}

select *
from {{ ref('incremental_source') }}

{% if is_incremental() %}
  -- this filter will only be applied on an incremental run
  where date_day > (select max(date_day) from {{ this }})
{% endif %}

If I run this, using dbt run -m incremental_source+, the model is created as expected the first time. Now, let's make a slight modification to the incremental_source.sql file:

{{ config(materialized = 'table') }}

{% for i in range(10) %}

  select {{ i+1 }} as id, to_date('{{ '2021-01-%02d' % (i+1) }}') as date_day, 'foo' as new_col {% if not loop.last %} union all {% endif %}

{% endfor %}

All I've done is add a new_col column to each record. In other adapters, running this would cause the incremental to fail. In dbt-spark, it succeeds, and just ignores the change entirely. The result is that you get differing behavior in Spark from other relational warehouses, and on_schema_change has no effect.

Expected behavior

Incremental materializations fail, by default, in the event of a source query schema change.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

The output of dbt --version:
0.21.0

The operating system you're using:
dbt Cloud

The output of python --version:
N/A

Additional context

Add any other context about the problem here.

@jnatkins jnatkins added bug Something isn't working triage labels Oct 8, 2021
@jnatkins
Copy link
Author

jnatkins commented Oct 8, 2021

Worth noting that I discovered #198 after I submitted this, so can see that on_schema_change is not supported yet, but maybe there is still a bug here in terms of the inconsistent default behavior of what incrementals do when a schema change does occur.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 14, 2021

@jnatkins Thanks for opening. The main reason that we're missing this functionality in dbt-spark==0.21.0 is that we need to call a few more macros within the spark-specific incremental materialization, for the sake of turning on on_schema_change functionality.

As far why this isn't in place yet, and the reason for my hesitation in #198: I think you're on the money. The default behavior is inconsistent between other databases and Spark/Databricks (specifically Delta). The reason is that merge with Delta is better at handling schema changes than most of the others.

On Delta, we can run:

    merge into <target>
      using <source>
      on <condition>
      when matched then update set *
      when not matched then insert *

Those * are powerful, and they handle cases where the column schemas differ between source and target. By default, Delta won't add/update columns to the target, but if schema evolution is enabled, it will.

So, on most other databases, the real on_schema_change features are all around append_new_columns and sync_all_columns, and the fail option is more or less synonymous with ignore. I think the real benefit on Spark/Databricks will be explicitly failing on fail, a.k.a. schema enforcement.

Next steps

Here's where I'm arriving at, conceptually, for the different on_schema_change options:

  • ignore: (default): dbt doesn't do anything, it's up to the database. Some data platforms (including Delta) offer built-in capabilities around schema evolution and enforcement, and those will take effect.
  • fail: dbt should explicitly enforce the schema and raise a compiler error.
  • append_new_columns: dbt should attempt to evolve the schema, but only in additive ways, without dropping any existing data.
  • sync_all_columns: dbt should attempt full schema evolution, including dropping data from no-longer-used columns. This may not be supported on all databases/platforms, e.g. Delta, which doesn't support alter table drop column: REPLACE COLUMNS unsupported? delta-io/delta#702 (comment)

Based on the way we implemented on_schema_change in dbt-core, I don't think the code changes to make that happen will be too complex. I'll open a quick PR that sketches out some of the initial changes.

What do you think?

@jtcohen6 jtcohen6 removed the triage label Oct 14, 2021
@jtcohen6 jtcohen6 self-assigned this Oct 14, 2021
@jnatkins
Copy link
Author

@jtcohen6 The mention of spark.databricks.delta.schema.autoMerge.enabled is definitely worth thinking through, but what I've seen on some other tickets is that that config is session-based, and adding it as a pre-hook for a model does not seem to actually persist the function when the table is created. I'm not sure if there's a Spark-specific config option to turn this on for the actual connection that executes the DDL, but that would potentially be a solution there.

You can see #162 and #217 for more context there. Anecdotally, I've heard the same complaint from some other users that I've been working with. Is there a correct way to handle?

I'm open to either of these options (implementing something that creates parity between function on other warehouses -- this is probably a good idea since dbt provides a useful abstraction layer that obviates code changes when migrating platforms/future-proofing -- or solving at the platform level via first-class support for autoMerge in dbt-spark)

@jtcohen6
Copy link
Contributor

You can see #162 and #217 for more context there. Anecdotally, I've heard the same complaint from some other users that I've been working with. Is there a correct way to handle?

Yes, this is something we need to do a much better job documenting. It sounds like the pre-hook method doesn't work reliably, so there are two other approaches that should be more reliable:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants