-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental models should handle column additions/deletions #1132
Comments
Well.. this will open a can of worms. Personally I am against any implicit smartness in this area by dbt. I feel this will only lead to confusion about why data has unexpected values. Some other considerations:
|
dbt handling schema changes automatically is akin to how fivetran/stitchdata ingestors behave. the upside is they don't stop ingesting, the downside is you inevitably end up chasing down issues and fixing things retroactively. I think I'd prefer being forced to deal with the discrepancy when it happens (eg dbt should throw an error). My only issue with the current behavior is there seemed to be some cases where it would fail silently and others where it would throw an error. Perhaps there's still value so long as you have to opt-in to this behavior? |
I think it's undesirable that there is currently no warning whatsoever, however I tend to agree with @elexisvenator that implicit smartness may not be the way to go. A happy medium might be to have dbt warn you, and provide migration statements that need to be run. Something along the lines of:
^ even that doesn't feel great though, but just an idea |
I'd definitely be in favor of having this as opt-in behavior. What @clrcrl wrote up sounds great when you're not opting-in, but if dbt knows what needs to be done, and is capable of doing it, why not provide that option? The alternative is everyone who wants this ends up building it themselves in whatever tool they're using to schedule dbt (we've actually already built something like this in an Airflow operator). |
Oh so maybe like:
|
Or maybe add it as an option in the model configuration when the model is incremental? That way model owners could decided on an individual basis what the behavior should be when the schema changes. Default to do nothing, but give the option to add/remove columns automatically and also the option to do a full refresh when doing so. Putting it in the model config would also avoid adding yet another command line flag which I'm always in favor of |
ProposalProblemRight now, we (Convoy) use Airflow workflows to run DBT jobs that transform data across several different Snowflake schemas. It’s fairly hands off, with little need for any users to manually run DBT commands. The current, painful, exception to this is when there is a schema change (mostly columns being added/removed), and a manual DBT command is needed to kick off a full refresh. Since this is predominantly the only DBT command that our users run, it is often missed as schema changes are rolled out, causing workflow failures until the command is run and the data is populated. If either DBT or the Airflow workflow were able to handle the logic of the schema change and full refresh, we could have a more reliable workflow. It’s also worth noting that right now we don't have a use case where we add columns and do not want to do a backfill of the data (i.e. we never want to add columns that have null historical data). SolutionBased off some brief brainstorming conversations with drewbanin and andriank-convoy, I think there are two potential DBT code changes that can help solve our problem. Both would require the addition of a new model config that allows incremental materialization users to opt-in to the new functionality:
Where valid Option 1Add support for Option 1 has the benefit of allowing users to choose between doing a full refresh or only changing the schema going forward, and also allows DBT to be the one system for handing all schema changes. However, this option requires three sets of work: Implementing the logic for column addition, implementing the logic for column deletion, and implementing the logic for kicking off the full data refresh from an incremental materialization. The schema changes would need to be implemented for all supported databases which increases the implementation cost, and right now the full refresh logic is only kicked off at the beginning of the DBT code, so some inversion of logic might(?) be necessary. Option 2 (Recommended)Add support for Option 2 allows users to opt for DBT to fail fast instead of using DBT to perform the required schema change. The user is allowed to make the decision about what to do immediately, instead of waiting for later data failures. Users will be able to differentiate requests that are using an incompatible schema from requests that have other errors, such as from invalid SQL syntax. This enables workflows to automatically kick off a full-refresh when they encounter the schema changing use case. One risk is that this option will need a unique error code to be allocated and a clear error message to prevent user confusion. This option also may require work across all supported databases, and would not support the case when users want to add a new column but not include historical data. Next stepsTo unblock our use case, I'm hoping to:
This approach is smaller scope than this full issue, but I think this approach would also lay scaffolding for future values as discussed in Option 1. They could be added as new values for the I'd appreciate any feedback or comments on these approaches and how they might fit into DBT development. Thanks! |
Thanks for the really thoughtful writeup @markgolazeski! I hadn't considered One thing to be aware of here is that full refreshes account for two types of changes:
While the comments above all provide examples of how to account for schema changes, I'd also want to consider what we should do (if anything) about logic changes. Schema changesI'd be interesting in supporting the following
I think the I do think there's merit to One last thing to say here: Column types sometimes change in reconcilable ways. Imagine that the first time you run an incremental model, the longest Logic changesSchema changes are easy to wrap your head around, but I think that if we revisit full refreshes, we'll also need to account for logic changes too. You can imagine a model with logic like:
which is later changed to:
While the schema isn't changing here, this logical SQL change will take effect for all new records processed by the incremental model, but historical data will be incorrect! A full refresh ensures that "old" and "new" records are both processed with the same logic. Just in the interest of brainstorming, I've always imagined that we'd handle cases like this by MD5ing the model code and inserting it as a column into the incremental model. The incremental materialization could have some mechanism to compare the MD5 that was used to process a record with the MD5 of the model that's running. If these two MD5s don't match up, then we'll know that some records were processed with a previous version of the incremental logic. This obviously isn't perfect (adding a single space to the end of model will look like a more fundamental logic update...) but I'm not sure of any alternate methods for solving this problem. In practice, this would look something like: Table: | id | dbt_model_md5 | We can run a query like:
If the resulting count is > 0, then we know that some sort of model change has occurred since the last time dbt ran. So, some questions for you @markgolazeski (and anyone else following along):
|
Thanks for the great response @drewbanin! I agree with your thoughts around the
|
I agree that handling logic changes would be great, but I think that's a separate (and much more complicated) issue. I like the idea of adding an on_schema_change per-model config option, and implementing the first option other than ignore: fail & return a specific error code. It's pretty simple, solves our immediate problem, and I think it establishes a good pattern fo handling things like this going forward (both for additional on_schema_change options, and potentially for on_logic_change options). |
Got it, I agree that In thinking through this a little further, I'm not sure that setting a custom exit code is going to work for us here. Imagine that we assign error code I think dbt does too many things in an invocation for error codes to be a sensible way of communicating specific failures back to the caller. Instead, I think we might be better served by adding additional information to the |
Yeah, I can totally see the limitations of using exit code because of DBT's nature of running multiple transactions. I can support making this information part of the My initial thought is to make it the Do you consider the I think adding a new field is probably the least intrusive, but I'm not sure if you have thoughts around what you'd want the future API to look like, and whether moving towards that is worth doing now. |
Some input on what I think is doable / copied from my slack input: 1) Removing a attribute. 2) Adding a new attribute 3) Data type change Trying anything more complex then this (IE, and SCD-1 or 2 type comparison where you need to start comparing the values of all model attributes to prevent dups) will become inherently more complex to do... Further, IMO we should not focus on supporting logic changes here, if you do that and you want to backfill that logic, you should make the conscious decision of still doing a full-refresh. In addition, if you want to change your logic, but still be able to do an deterministic full refresh to an incremental model without changing past data, one should make business rules time sensitive.
discarding the fact that I probably would be against this / solve this by introducing a new measure with a different name... |
Any movement on this? This is the last thing we're missing explicit handling for in our CICD pipeline. |
Hello, I think schema evolution has merit within some rules (in my mind this is only applicable to incremental and snapshots);
Of course all of these are optional and should be and should not be the default behaviour, only if an enterprise "opts in". The merits for having this option are:
|
Hello, this issue has been opened for 2 yrs now so I suppose it is dead, right? |
@Vojtech-Svoboda This issue is very much alive. It's just not something we can prioritize right now, since we're focused on the stability and consistency improvements that we need to make before we can release dbt v1.0.0. I'm hopeful that we can give materializations some love next year, not just to refactor some code (which has grown quite gnarly), but also to add long-lived-and-popular features such as this one. In the meantime, the detailed proposals are seriously helpful in grounding that future work. |
Hi ! I used it to write a materialization for persistent staging area tables. Provided that you are tracking change with a hash key of the columns (in table physical order) - something like the macro at the end of this post - you don't have to recompute existing rows' hash key under these conditions :
|
Thanks to the amazing work of @matt-winkler over the past few months, this will be coming in dbt v0.21 :) Docs forthcoming. Check out #3387 for details. |
Feature
Feature description
dbt should add or remove columns in the destination table for incremental models to make the target schema match the schema produced by the model SQL. dbt already has some existing logic that's responsible for "expanding" column types to make model SQL and the target table share a compatible schema. dbt should use a similar mechanism to add new columns and delete extraneous columns as required to make the target schema and the model SQL identical.
Ideas:
Caveats:
Who will this benefit?
users of incremental models
The text was updated successfully, but these errors were encountered: