-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: override materialization python models #8538
fix: override materialization python models #8538
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8538 +/- ##
===========================================
- Coverage 89.22% 62.79% -26.43%
===========================================
Files 183 183
Lines 23457 23630 +173
===========================================
- Hits 20930 14839 -6091
- Misses 2527 8791 +6264
Flags with carried forward coverage won't be shown. Click here to find out more.
|
1dd087b
to
f4d3551
Compare
Hi @gshank, Could I please receive feedback on this PR? One aspect I'm unable to understand is why the codecov is reducing the percentage of coverage. This PR did not introduce any branch conditions inside the code. I've noticed a similar decrease in some other PRs as well; should I be concerned about this at the moment? |
I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this. |
Thank you for your feedback, @gshank. I agree that this change may not be the ideal one. It's a great idea to bring @aranke into this discussion, since they were responsible for the original commit that introduced this behavior. Note: As I mentioned in the PR, it seems to me that the solution from @aranke is easier to reconcile the differences between Python and SQL models. I've tried to rethink the behavior that @aranke introduced and try to place it elsewhere but couldn't, maybe there really isn't a cleaner and fast solution yet, especially since it's a behavior that breaks the convention of all versions post 1.4. Hey @aranke, could you share your thoughts to enrich our conversation? What do you think? |
Hey @devmessias, My original PR was just to default to |
A big issue is that removing the default table behavior could break numerous pipelines, as views are not supported in Python models. |
One option I am considering, given that @aranke only wanted to change the default materialization for Python (so I assume there is no point in considering future default changes), and as @gshank suggested it to be something in |
@devmessias I agree with your proposed solution. In update_parsed_node_config check that the language is python and that materialization is not set in the config dict, and if both of those are true, set it in the config dict. |
See dbt-labs#8538 discussion for more details
See dbt-labs#8538 discussion for more details
152ea5c
to
8f95385
Compare
I've updated the PR with the proposed solution @gshank. Could you review the changes to ensure we're on the same page? |
Ah, I forgot to mention one thing. The default is set before the update_node_config. As a result, the following test created by @aranke , which was passing in my previous suggestion, won't pass now. Would it be better to remove this test and create a new one that checks only after the update_node_config? |
Yes, that would be better. If you can find a way to do it in unit tests, that would be fine, but it might be easier to just look at the manifest for the python node in a functional test. |
Hi @gshank. I've managed to determine how to retain the unit test. |
Hi @aranke . What do you think about this new approach? |
Hi @gshank, just circling back on this pull request. Would appreciate your feedback when it's convenient for you. Thanks! |
Hi @dbeatty10, @gshank and @aranke, any updates on this bug/MR? I'm not sure about the dbt-core roadmap, or if bug #8520 was resolved in another MR, because if so, I can close this PR. |
@graciegoheen and @peterallenwebb when you take a look at this, could you also check if it resolves #9703? The reprex is here: #9703 (comment) |
|
||
resource = lookup.get(resource_type, NodeConfig) | ||
|
||
if language == "python" and (resource == NodeConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the enum value here instead of a string literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the enum:
dbt-core/core/dbt/artifacts/resources/types.py
Lines 54 to 56 in a7af3b3
class ModelLanguage(StrEnum): | |
python = "python" | |
sql = "sql" |
Here's an example of using it:
dbt-core/core/dbt/parser/base.py
Line 25 in a7af3b3
from dbt.node_types import AccessType, ModelLanguage, NodeType |
dbt-core/core/dbt/parser/base.py
Line 218 in a7af3b3
language = ModelLanguage.python |
core/dbt/context/context_config.py
Outdated
@@ -82,8 +82,9 @@ def get_config_dict(self, resource_type: NodeType) -> Dict[str, Any]: | |||
|
|||
|
|||
class BaseContextConfigGenerator(Generic[T]): | |||
def __init__(self, active_project: RuntimeConfig): | |||
def __init__(self, active_project: RuntimeConfig, language: Optional[ModelLanguage] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gshank @ChenyuLInx Does this seem like the best way to do this? It seems odd to me that language would be the one element that needs to be passed separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmessias Thanks for the contribution and engaging discussion, sorry we have been a behind on this patch.
I have the same feeling when reading this code.
I can see the appeal of having a proper PythonNode with default materialization being table
.
But because this node is buried deep under layers functions, I don't think having this language passed through is ideal.
I would recommend going back to what Gerda initially proposed (link)
In update_parsed_node_config check that the language is python and that materialization is not set in the config dict, and if both of those are true, set it in the config dict.
That would make this PR easier to follow. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your feedback
It’s been quite some time since I worked on this, but if memory serves me right, the approach you’re referring to is indeed what I implemented in this commit: 8f95385. However, this change introduced another bug which was caught by a unit test, as documented here: unit test issue comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterallenwebb @ChenyuLInx I believe the need to have a Python node is because the default materialization type in dbt is not supported for Python models (at least when I made this PR). This issue was not apparent before because it was being obscured by a materialization bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this the unit test is failing because the config_dict generated by config_dict = config.build_config_dict(patch_config_dict=patch_config_dict)
is what's being added to the parsed node in the end.
And I see why you have the current solution which is creating a place to define a default node.
I think passing the language through multiple layers of calls is not ideal, but if we want to do something better, we have two ways:
- do some large refactors to have this config consolidation function moved to a node and that would give us an easier way to define a default materialization on a python model node
- slightly twist the patch_config_dict(or create a new version of it) with any config that is different from the default default, and apply it. We should add some comments to it also. (I personally prefer this for the current case, and we can refactor later on if we see a pattern developing here).
So this would look like adding the following logic here, and update the comment a bit.
if not patch_config_dict:
patch_config_dict = {}
if parsed_node.resource_type == NodeType.Model and parsed_node.language == ModelLanguage.python:
if "materialized" not in patch_config_dict:
patch_config_dict["materialized"] = "table"
@devmessias @gshank what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree @ChenyuLInx . It's more similar to my first proposal and will solve the default materialization issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmessias are you still interested in making an update to this? No worries if not, we can take it across the finish line. Thanks either way for putting so much effort into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I’m still interested even though I’ve switched jobs. Now it’s a matter of honor 😆 . I will make the modifications this week and do a rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @devmessias just wanted to check if you're still interested in completing this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @graciegoheen @ChenyuLInx @gshank, my current role has kept me really busy for quite a while, and I haven't been using dbt lately.I̶'̶l̶l̶ ̶t̶r̶y̶ ̶t̶o̶ ̶g̶e̶t̶ ̶t̶o̶ ̶i̶t̶ ̶o̶v̶e̶r̶ ̶t̶h̶e̶ ̶w̶e̶e̶k̶e̶n̶d̶,̶ ̶t̶h̶o̶u̶g̶h̶.̶
By the way, I know this might not be the best channel to ask, I applied for a position for the dbt-core team, I noticed that the position was listed as Romania (Remote), but I wasn’t sure if I could apply from Brazil, so I went ahead and applied anyway. I’ve contributed to dbt in the past, if any of you have any tips or insights, I'd really appreciate it!
updated
Okay, since the solution was similar to the first one, it was really easy to revert everything and perform a rebase. 😌 I've updated it with the suggestions and also changed the test for the custom Python model from view to be incremental (@alex-hsp). I believe views are still not supported, right?
See dbt-labs#8538 discussion for more details
See dbt-labs#8538 discussion for more details
See dbt-labs#8538 discussion for more details
See dbt-labs#8538 discussion for more details
6faca6d
to
111ef65
Compare
ping everyone! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the patch!!
resolves #8520 , resolves #9703
Problem
Current Behavior
When working within my company's workflow, we outline all configurations, metadata, etc. in the schema.yml file, with broader configurations being described in the dbt_projects.yml. However, the materialization of Python models cannot be controlled through these files. This requires us to set the configuration within the model itself, as demonstrated below:
If this isn't done, the materialization always defaults to 'table' in the generated manifest.json.
commit that introduces this behavior: https://github.com/dbt-labs/dbt-core/blame/48d04e81417195393af5af1f78ef695f3398f193/core/dbt/parser/base.py#L217
Expected Behavior
I would expect that the materialization of Python models could be controlled through these configuration files, just as it is observed with SQL models. Ideally, without having to specify the materialization within the model itself. When defining configurations in these files, they should be respected, and the materialization shouldn't default to 'table' in the generated manifest.json unless explicitly set to do so. The current behavior introduces an inconsistency in the way dbt models are configured and managed.
Solution
I've considered three approaches to address this bug. Firstly, I'll outline the two solutions I discarded, followed by an explanation of why I chose the third.
First Solution: (discarded)
Remove the behavior introduced by commit 48d04e8.
While this is a straightforward, it could break many existing pipelines. This would be unfavorable for many dbt users because it's a behavior available since v1.4.
Second Solution: (discarded)
The proposal here is to add a condition where the default materialization value would only used if it hasn't been predefined in a .yml file or within the model's .py file itself.
However, this approach has two major issues:
1- The solution is messy and has a high cognitive load.
2- If the configuration for this materialization isn't defined in the model's .py or any .yml file, it defaults to creating a "view" materialization. This is because the NodeConfig class defaults to "view" as its materialization.
dbt-core/core/dbt/contracts/graph/model_config.py
Lines 438 to 441 in e5e1a27
It's just a hypothesis, but I believe this might be the very reason for commit 48d04e8 since "view" is not supported for Python models.
Third Solution (preferred):
In base.py, the proposal is to add a key-value pair to the config_call_dict.
https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/parser/base.py#L215-L224,
Parameters described in _dbt_node_type_configs are used to overwrite the initial node configuration:
https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L136
Since we don't want the "_dbt_node_type_configs" to persist, I opted to remove it in the latest update.
https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L152-L156
Checklist