Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: override materialization python models #8538

Merged

Conversation

devmessias
Copy link
Contributor

@devmessias devmessias commented Aug 31, 2023

resolves #8520 , resolves #9703

Problem

Current Behavior

When working within my company's workflow, we outline all configurations, metadata, etc. in the schema.yml file, with broader configurations being described in the dbt_projects.yml. However, the materialization of Python models cannot be controlled through these files. This requires us to set the configuration within the model itself, as demonstrated below:

def model(dbt, session):
    dbt.config(materialization='incremental')

If this isn't done, the materialization always defaults to 'table' in the generated manifest.json.

commit that introduces this behavior: https://github.com/dbt-labs/dbt-core/blame/48d04e81417195393af5af1f78ef695f3398f193/core/dbt/parser/base.py#L217

Expected Behavior

I would expect that the materialization of Python models could be controlled through these configuration files, just as it is observed with SQL models. Ideally, without having to specify the materialization within the model itself. When defining configurations in these files, they should be respected, and the materialization shouldn't default to 'table' in the generated manifest.json unless explicitly set to do so. The current behavior introduces an inconsistency in the way dbt models are configured and managed.

Solution

I've considered three approaches to address this bug. Firstly, I'll outline the two solutions I discarded, followed by an explanation of why I chose the third.

First Solution: (discarded)

Remove the behavior introduced by commit 48d04e8.

While this is a straightforward, it could break many existing pipelines. This would be unfavorable for many dbt users because it's a behavior available since v1.4.

Second Solution: (discarded)

The proposal here is to add a condition where the default materialization value would only used if it hasn't been predefined in a .yml file or within the model's .py file itself.

# core/dbt/parser/base.py:L192
if block.path.relative_path.endswith(".py"):
    language = ModelLanguage.python
    config.add_config_call({"_dbt_default_arg_materialized": "table"})
# core/dbt/context/context_config.py:L140
if patch_config_dict:
    result = self._update_from_config(result, patch_config_dict)

default_args = [k for k in config_call_dict if k.startswith("_dbt_default_arg_")]
result_keys = dir(result)
for key in default_args:
    config_key = key.replace("_dbt_default_arg_", "")
    if config_key in result_keys:
        del config_call_dict[key]
    else:
        config_call_dict[config_key] = config_call_dict[key]

However, this approach has two major issues:

1- The solution is messy and has a high cognitive load.
2- If the configuration for this materialization isn't defined in the model's .py or any .yml file, it defaults to creating a "view" materialization. This is because the NodeConfig class defaults to "view" as its materialization.

class NodeConfig(NodeAndTestConfig):
# Note: if any new fields are added with MergeBehavior, also update the
# 'mergebehavior' dictionary
materialized: str = "view"

It's just a hypothesis, but I believe this might be the very reason for commit 48d04e8 since "view" is not supported for Python models.

Third Solution (preferred):

In base.py, the proposal is to add a key-value pair to the config_call_dict.

https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/parser/base.py#L215-L224,

Parameters described in _dbt_node_type_configs are used to overwrite the initial node configuration:

https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L136

Since we don't want the "_dbt_node_type_configs" to persist, I opted to remove it in the latest update.

https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L152-L156

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@devmessias devmessias requested a review from a team as a code owner August 31, 2023 21:04
@devmessias devmessias requested a review from gshank August 31, 2023 21:04
@cla-bot cla-bot bot added the cla:yes label Aug 31, 2023
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.79%. Comparing base (7940ad5) to head (7d743da).
Report is 53 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (7940ad5) and HEAD (7d743da). Click for more details.

HEAD has 23 uploads less than BASE
Flag BASE (7940ad5) HEAD (7d743da)
unit 4 1
integration 20 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #8538       +/-   ##
===========================================
- Coverage   89.22%   62.79%   -26.43%     
===========================================
  Files         183      183               
  Lines       23457    23630      +173     
===========================================
- Hits        20930    14839     -6091     
- Misses       2527     8791     +6264     
Flag Coverage Δ
integration ?
unit 62.79% <100.00%> (+0.70%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.79% <100.00%> (+0.70%) ⬆️
Integration Tests 62.79% <100.00%> (-26.43%) ⬇️

@devmessias devmessias changed the title [W.I.P.] fix: override materialization python models fix: override materialization python models Aug 31, 2023
@devmessias devmessias force-pushed the fix/override_materialization_python_models branch from 1dd087b to f4d3551 Compare September 8, 2023 22:20
@devmessias
Copy link
Contributor Author

Hi @gshank,

Could I please receive feedback on this PR? One aspect I'm unable to understand is why the codecov is reducing the percentage of coverage. This PR did not introduce any branch conditions inside the code. I've noticed a similar decrease in some other PRs as well; should I be concerned about this at the moment?

@dataders dataders added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Sep 13, 2023
@gshank
Copy link
Contributor

gshank commented Sep 14, 2023

I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this.

@devmessias
Copy link
Contributor Author

devmessias commented Sep 18, 2023

I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this.

Thank you for your feedback, @gshank. I agree that this change may not be the ideal one.

It's a great idea to bring @aranke into this discussion, since they were responsible for the original commit that introduced this behavior.

Note: As I mentioned in the PR, it seems to me that the solution from @aranke is easier to reconcile the differences between Python and SQL models. I've tried to rethink the behavior that @aranke introduced and try to place it elsewhere but couldn't, maybe there really isn't a cleaner and fast solution yet, especially since it's a behavior that breaks the convention of all versions post 1.4.

Hey @aranke, could you share your thoughts to enrich our conversation? What do you think?

@aranke
Copy link
Member

aranke commented Sep 25, 2023

Hey @devmessias,

My original PR was just to default to table materialization for Python models, but I see that it's had this unintended side effect.
I agree with @gshank that we should probably take a step back and check what the expected behavior should be.

@devmessias
Copy link
Contributor Author

A big issue is that removing the default table behavior could break numerous pipelines, as views are not supported in Python models.

@devmessias
Copy link
Contributor Author

One option I am considering, given that @aranke only wanted to change the default materialization for Python (so I assume there is no point in considering future default changes), and as @gshank suggested it to be something in update_parsed_node_config, we could simply check if the language in parsed_node within the same function is a Python model and at the same time verify if the materialization is set as view, then change it to table. Since this is what causes the issue. What do you think? It would also have the advantage of being much simpler.

@gshank
Copy link
Contributor

gshank commented Oct 2, 2023

@devmessias I agree with your proposed solution. In update_parsed_node_config check that the language is python and that materialization is not set in the config dict, and if both of those are true, set it in the config dict.

devmessias added a commit to devmessias/dbt-core that referenced this pull request Oct 2, 2023
devmessias added a commit to devmessias/dbt-core that referenced this pull request Oct 2, 2023
@devmessias devmessias force-pushed the fix/override_materialization_python_models branch from 152ea5c to 8f95385 Compare October 2, 2023 17:36
@devmessias
Copy link
Contributor Author

I've updated the PR with the proposed solution @gshank. Could you review the changes to ensure we're on the same page?

@devmessias
Copy link
Contributor Author

Ah, I forgot to mention one thing. The default is set before the update_node_config. As a result, the following test created by @aranke , which was passing in my previous suggestion, won't pass now. Would it be better to remove this test and create a new one that checks only after the update_node_config?

image

@gshank
Copy link
Contributor

gshank commented Oct 2, 2023

Yes, that would be better. If you can find a way to do it in unit tests, that would be fine, but it might be easier to just look at the manifest for the python node in a functional test.

@devmessias
Copy link
Contributor Author

Hi @gshank. I've managed to determine how to retain the unit test.
My previous approach was somewhat messy as it also had to manage config.build_config_dict(patch_config_dict=patch_config_dict) to address all issues within update_parsed_node_config. However, I've modified update_parsed_node_config to pass the model language as an argument to ContextConfigGenerator, thereby creating an appropriate NodeConfig for the Python model. I believe this method is superior. Could you please review my new commit and share your thoughts?

@devmessias
Copy link
Contributor Author

Hi @aranke . What do you think about this new approach?

@devmessias
Copy link
Contributor Author

Hi @gshank, just circling back on this pull request. Would appreciate your feedback when it's convenient for you. Thanks!

@devmessias
Copy link
Contributor Author

Hi @dbeatty10, @gshank and @aranke, any updates on this bug/MR? I'm not sure about the dbt-core roadmap, or if bug #8520 was resolved in another MR, because if so, I can close this PR.

@dbeatty10 dbeatty10 added the community This PR is from a community member label Apr 3, 2024
@graciegoheen graciegoheen requested review from peterallenwebb and removed request for gshank April 10, 2024 18:51
@dbeatty10
Copy link
Contributor

@graciegoheen and @peterallenwebb when you take a look at this, could you also check if it resolves #9703?

The reprex is here: #9703 (comment)


resource = lookup.get(resource_type, NodeConfig)

if language == "python" and (resource == NodeConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the enum value here instead of a string literal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the enum:

class ModelLanguage(StrEnum):
python = "python"
sql = "sql"

Here's an example of using it:

from dbt.node_types import AccessType, ModelLanguage, NodeType

language = ModelLanguage.python

@@ -82,8 +82,9 @@ def get_config_dict(self, resource_type: NodeType) -> Dict[str, Any]:


class BaseContextConfigGenerator(Generic[T]):
def __init__(self, active_project: RuntimeConfig):
def __init__(self, active_project: RuntimeConfig, language: Optional[ModelLanguage] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank @ChenyuLInx Does this seem like the best way to do this? It seems odd to me that language would be the one element that needs to be passed separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmessias Thanks for the contribution and engaging discussion, sorry we have been a behind on this patch.

I have the same feeling when reading this code.
I can see the appeal of having a proper PythonNode with default materialization being table.
But because this node is buried deep under layers functions, I don't think having this language passed through is ideal.
I would recommend going back to what Gerda initially proposed (link)

In update_parsed_node_config check that the language is python and that materialization is not set in the config dict, and if both of those are true, set it in the config dict.

That would make this PR easier to follow. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for your feedback

It’s been quite some time since I worked on this, but if memory serves me right, the approach you’re referring to is indeed what I implemented in this commit: 8f95385. However, this change introduced another bug which was caught by a unit test, as documented here: unit test issue comment

Copy link
Contributor Author

@devmessias devmessias May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterallenwebb @ChenyuLInx I believe the need to have a Python node is because the default materialization type in dbt is not supported for Python models (at least when I made this PR). This issue was not apparent before because it was being obscured by a materialization bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this the unit test is failing because the config_dict generated by config_dict = config.build_config_dict(patch_config_dict=patch_config_dict) is what's being added to the parsed node in the end.
And I see why you have the current solution which is creating a place to define a default node.

I think passing the language through multiple layers of calls is not ideal, but if we want to do something better, we have two ways:

  • do some large refactors to have this config consolidation function moved to a node and that would give us an easier way to define a default materialization on a python model node
  • slightly twist the patch_config_dict(or create a new version of it) with any config that is different from the default default, and apply it. We should add some comments to it also. (I personally prefer this for the current case, and we can refactor later on if we see a pattern developing here).

So this would look like adding the following logic here, and update the comment a bit.

if not patch_config_dict:
    patch_config_dict = {}
if parsed_node.resource_type == NodeType.Model and parsed_node.language == ModelLanguage.python:
    if "materialized" not in patch_config_dict:
        patch_config_dict["materialized"] = "table"

@devmessias @gshank what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @ChenyuLInx . It's more similar to my first proposal and will solve the default materialization issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmessias are you still interested in making an update to this? No worries if not, we can take it across the finish line. Thanks either way for putting so much effort into this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I’m still interested even though I’ve switched jobs. Now it’s a matter of honor 😆 . I will make the modifications this week and do a rebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devmessias just wanted to check if you're still interested in completing this PR!

Copy link
Contributor Author

@devmessias devmessias Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @graciegoheen @ChenyuLInx @gshank, my current role has kept me really busy for quite a while, and I haven't been using dbt lately.I̶'̶l̶l̶ ̶t̶r̶y̶ ̶t̶o̶ ̶g̶e̶t̶ ̶t̶o̶ ̶i̶t̶ ̶o̶v̶e̶r̶ ̶t̶h̶e̶ ̶w̶e̶e̶k̶e̶n̶d̶,̶ ̶t̶h̶o̶u̶g̶h̶.̶

By the way, I know this might not be the best channel to ask, I applied for a position for the dbt-core team, I noticed that the position was listed as Romania (Remote), but I wasn’t sure if I could apply from Brazil, so I went ahead and applied anyway. I’ve contributed to dbt in the past, if any of you have any tips or insights, I'd really appreciate it!

updated

Okay, since the solution was similar to the first one, it was really easy to revert everything and perform a rebase. 😌 I've updated it with the suggestions and also changed the test for the custom Python model from view to be incremental (@alex-hsp). I believe views are still not supported, right?

tests/unit/test_parser.py Outdated Show resolved Hide resolved
devmessias added a commit to devmessias/dbt-core that referenced this pull request Oct 2, 2024
devmessias added a commit to devmessias/dbt-core that referenced this pull request Oct 2, 2024
devmessias added a commit to devmessias/dbt-core that referenced this pull request Oct 2, 2024
@devmessias devmessias force-pushed the fix/override_materialization_python_models branch from 6faca6d to 111ef65 Compare October 2, 2024 17:24
@devmessias
Copy link
Contributor Author

ping everyone!

Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the patch!!

core/dbt/context/context_config.py Outdated Show resolved Hide resolved
@ChenyuLInx ChenyuLInx merged commit 84230ce into dbt-labs:main Nov 14, 2024
53 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
8 participants