-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev #7432
Comments
There's more to figure out here, but an immediate action we can take is revisiting this in light of the implementation of #7796. My sense is that will only be called by certain commands, whereas we would need |
i love this idea, i think it lends itself well to a new implementation of dbt and could be used as a check that the logic now spelled out in dbt code will in fact output data in the same format as the existing data. what i'd really love to see though is a column level contract to enable models that are only partially covered by a contract. all or nothing makes adoption of contracts cumbersome and time consuming, especially the data types requirement imo. |
Relevant follow-up work from the |
This would be a great addition. I have a use case where I would like to ensure that a subset of columns are present with the correct data_type, whilst being lenient about any others undeclared in the contract. Perhaps a simple contract setting at the model level such as enforce_undeclared_columns: false |
I'm also very much interested in the partial contracts feature of this ticket - this would be really useful for ensuring a given amount of a model conforms to the intended shape while allowing divergence on any other customizations implemented in a data meshy environment. |
We've started spiking this over in #8339, with some promising results! This is an "inferred" contract, not a "partial" contractI want to distinguish between two ideas that I'm seeing in the comments above: an "inferred" model contract is different from a "partial" model contract. With an "inferred" contract, every column is contracted, but only some columns are explicitly specified in yaml. The names and data types of other columns are inferred from the columns in production. A "partial" contract would mean, only some of the columns in this model are actually reliable guarantees for downstream consumers. I think that's a much trickier undertaking, requiring careful thinking about column-level documentation and access policies. (How can a downstream consumer reliably know which columns are "under contract," and which ones might go away at any moment?) The scope of this issue is "inferred" contracts only. As soon as a model is pushed into production, every column is "under contract," and dbt will raise warnings/errors if any is renamed/retyped/removed. The result is the same, but the input is what's different: users only need to specify columns if those columns also have constraints, tests, descriptions, etc. (In the current spike implementation, if you want to specify anything about a column, it needs to be fully specified, including a What about when deferral is not available?One more thing: Inferred contracts can only be enforced when a manifest is provided and
Does that feel right? Should we instead raise a warning, saying that the contract could not be fully enforced, but allowing the user to continue for now? I foresee some users switching the contract to |
@jtcohen6 re the split of partial vs inferred, it might be worth separating the two definitions into additional issues on here https://docs.getdbt.com/docs/collaborate/govern/model-contracts#can-i-define-a-partial-contract I would be interested in the "partial" definition as you've outlined in here, rather than the inferred one but at the moment I think there is a single issue for both :) |
## What are you changing in this pull request and why? Neither of these things exists yet — but they are distinct, and one is likely to exist much sooner than the other! Context: - dbt-labs/dbt-core#7432 (comment)
+1 for this feature but I still feel that one of the major issue with current implementation of data contracts within dbt is that the data contract can be modified unilaterally by one party. It goes against the idea of a "contract" that both parties agree upon and that can only be changed if both parties agree to change it. |
+1 |
1 similar comment
+1 |
…s" contract, follow dbt-labs/dbt-core#7432 for partial contract feature update
Wanted to drop some thoughts on the enforced component of contracts. Currently in order to use constraints you have to have a fully defined table to also enforce a contract. We typically like to define important columns and metadata comments but do not define their types within the yaml. Typing is done in the sql models with some additional macro support for custom logic. Constraints can provide a lot of capability outside of just enforcing a contract (such as defining PK/FK relationships within the dwh for BI tools). I'd really like to see the ability to apply constraints without having to fully implement the other requirements of a contact (defining each column and its datatype for example). I'd really appreciate being able to have constraints applied, without having to manage all of the other components that are currently required for a contract. |
The proposed mechanism seems to extend the existing one of the BTW, at least for Snowflake, dbt doesn't respect the run order of the models based on the FK relationships. This can lead to failing FK declaration queries, as the PK of the referred model may not be declared yet. Reason: the referred model is not an upstream model, of the one that declares the FK relationship. And explicitly declaring it as such via So in conclusion, I second @kmarq's remarks on constraints and I'd like to suggest that their execution order needs to have a dependency graph of its own, which cannot be reflected by existing dbt model dependencies in an efficient way. |
Warning! Negative but hopefully constructive feedback: With contracts, explicit is better than implicit and having inferred contracts is going to add a lot of headaches. I would like to suggest instead tooling to generate the yaml, so that contracts are always explicit, as more sustainable long term approach. This has the added benefit of being helpful outside of contracts as well, solving one pain point of extensive dbt usage. I believe creating/merging this “implicit contract” will be a footgun. Hope examples/details below are helpful:Currently, we use a variety of ways to generate more explicit non-dbt contracts from dbt manifest and yaml files, and it would be great if contracts were a true first class citizen of dbt itself for our chosen adapters. But making it optional or tying it “to the state in prod” is very dangerous and has not been thought through. For example, what should happen to an implicit contract upon database failure or data issues/type corruption etc.? How should contracts behave when migrating models to a new database or tool? How much can users rely on an implicit contract that’s not in version control but is state-dependent in a (likely distributed) DBMS? What about datalake or columnar analytics databases/DWs? How useful are contracts when they are essentially a Describe query on a database instead of yaml that can be consumed by various tooling/services in data platforms? Benefits of explicit YAML: Onr can have code review over yaml. One can have meetings with non-technical stakeholders over yaml, easily translating it to a myriad of consumable formats. Downstream users can read and trust it, without having to understand database internals. This is not the case for an implicit contract in the way being designed here. The whole name seems to be a misnomer as implicit is not enforceable in a contract. Perhaps you can call it “contract generation”, and look at it from that angle? Of course we want to minimize manual work but that can be achieved without sacrificing the core feature of a data contract. The reason I leave this feedback here instead of in a feature request is because I believe having non-explicit pseudo contracts will be actively harmful and create difficulties for us and our users down the line, and for dbt practitioners worldwide. It would be better to not have it than have it be implicit. Beside what I enumerated here, there are a lot of problems tying the desired state, contract, to the actual state (database). Dbt is already the tool to enforce this. Contracts should not assume the database to be valid, it may not be. They should be a contract and define the expected/desired state, and dbt should help validate this expectation. It shouldn’t make the mistake of giving false confidence in the data by having implicit contracts that don’t happen to fail during changes/disruptions/etc. I think the request here around using constraints would also be handled better without an implicit contract, but instead with generated yaml. Finally, these data contracts are producer owned data contracts, so they are one party. That is fine, as long as it is explicit 😊. But they are not merely the current API definition (DBMS’s have this feature). Users should be able to rely on the contract to only change when intended by the producer, not when someone sneezes at the database. This way, they get value and there is a concept of breaking changes that can be handled at the contract definition/code level, giving a platform for users concerns as well. Making contracts implicit would erode their value for users/consumers as well. |
in terms of partial contracts for constraints, i wonder if it's worth bringing this concept into the mix here as it would solve for both the zero-cost aspect of constraints and the only-build-if-valid aspect, but via a more generic route that doesn't fragment yaml between tests and constraints. That option also works for views. |
I really like emirko's yml code generator idea. Is this something that could be scoped into a v.2 of dbt assist? |
upvote |
+1 |
In v1.5, we added support for:
What if we could manage to do that, without requiring all that yaml to be defined upfront? We infer the contract on the basis of the current version of this model in production. In that way, we also marry together the contract’s enforcement and its breaking change detection.
This would require a mechanism similar to what's proposed in #7256 / #7258. If provided with a manifest for deferral / state comparison, we should store a
state_relation
associated with the prod counterpart of each node, in the current manifest.During the model’s materialization:
ContractBreakingChangeError
if an existing column is renamed/removed/retyped.Pseudo logic:
Considerations
We'd need some special handling when a model is being renamed, or created for the first time. If there is no
state_relation
, there is no contract to infer/enforce — it's whatever the model SQL says.We would need to do more work to support partial contract definitions—a mix of explicit & inferred—whereby a user could define certain columns with constraints, dynamic data masking rules (!), etc—but wouldn’t have to define them all.
dbt would still need to have access to the the full column spec, during
create table
DDL. How to get the names and data types of columns that are not defined in yaml?A lot of that work might be around type aliasing, which gets tricky quickly. Many database cursors return column types with a genuinely different naming convention from how that database supports receiving them.We know what thestate_relation
is, we should be able to just run adescribe
statement against it (adapter.get_columns_in_relation
) to get back data types that we can us within thecreate table
DDL.If push comes to shove, we could switch out the current implementation of the "preflight" check (run a single "empty" query and get the schema from the cursor) for an even-more-naive one, where we create a temp table and run adescribe
statement / information schema query against it.The text was updated successfully, but these errors were encountered: