Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select models which have tests with additional dependencies #751

Closed
wants to merge 14 commits into from

Conversation

adammarples
Copy link
Contributor

@adammarples adammarples commented Dec 6, 2023

Description

Tests such as the relationship test will fail to be selected if we are selecting models with tags and using the manifest loading method.

Related Issue(s)

closes #719

Breaking Change?

Almost all tests will have only one 'depends_on', but when they have two, ie. relationship, the true parent is the last one.

load_from_dbt_manifest now considers the primary selectable parent model of a test to be the last item in its 'depends_on' list, rather than the first.

This would break any tests being loaded by tag selection using load_from_dbt_manifest which somehow depended on a model being the first in that list rather than the last.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@adammarples adammarples requested a review from a team as a code owner December 6, 2023 15:30
@adammarples adammarples requested a review from a team December 6, 2023 15:30
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 6, 2023
Copy link

netlify bot commented Dec 6, 2023

Deploy Preview for amazing-pothos-a3bca0 ready!

Name Link
🔨 Latest commit 213620f
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/657af27701122100082a33fc
😎 Deploy Preview https://deploy-preview-751--amazing-pothos-a3bca0.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@dosubot dosubot bot added area:selector Related to selector, like DAG selector, DBT selector, etc area:testing Related to testing, like unit tests, integration tests, etc dbt:test Primarily related to dbt test command or functionality parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing labels Dec 6, 2023
Copy link

codecov bot commented Dec 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (82e8db9) 93.27% compared to head (6c4bfe4) 93.28%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #751      +/-   ##
==========================================
+ Coverage   93.27%   93.28%   +0.01%     
==========================================
  Files          55       55              
  Lines        2499     2503       +4     
==========================================
+ Hits         2331     2335       +4     
  Misses        168      168              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @adammarples! I'm glad you're improving this part of the code.

I'll loop in @edgarasnavickas (@navedgaras), who previously worked on this, so we can get his thoughts as well.

I left some questions/comments inline

@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool:
self.visited_nodes.add(node_id)

if node.resource_type == DbtResourceType.TEST:
node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only care about model dependencies? Why don't we care about potential parent seeds/sources tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point and should be taken out

@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool:
self.visited_nodes.add(node_id)

if node.resource_type == DbtResourceType.TEST:
node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")]
parent_id = dependency_ids[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to pick only one parent? Wouldn't it be better if we took into account all the parents?

In your PR description, you mention:

Almost all tests will have only one 'depends_on', but when they have two, ie. relationship, the true parent is the last one.

What is the definition of a true parent? If a test depends on two models, A and B, and they are run as independent Airflow tasks, why should the relationship with B be more relevant than the one with A?

Do we have any dbt docs on this?

Copy link
Contributor Author

@adammarples adammarples Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to pick one parent because tests in dbt implicitly belong to only one thing, they can only be declared inside a model/source/seed etc config block.

For example in the "model.jaffle_shop.orders" model, the relationship test with the customers model. This is a test on the orders model, defined under the - name: orders section of the model config, the test shouldn't run after the customers table is built too, the customers table hasn't failed just because someone wrote bad code in the orders model.

If there was explicit support in dbt for tests which belong to more than one model they would have to change it such that tests don't have to be declared only within a model config block such that they implicitly belong only to one model and the code dbt test --select orders runs the test but the code dbt test --select customers does not. I'm not aware of any such thing in dbt at the moment.

There was a bit of discussion about this here dbt-labs/dbt-core#6746

Copy link
Collaborator

@tatiana tatiana Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adammarples what about tests like the dbt_utils equality, and others, including:

  • equal_rowcount
  • cardinality_equality
  • fewer_rows_than
  • relationships_where

Example:

version: 2

models:
  - name: model_name
    tests:
      - dbt_utils.equality:
          compare_model: ref('other_table_name')
          compare_columns:
            - first_column
            - second_column

Even though they are declared part of the model_name block, they also depend on the other_table_name.

I didn't check how dbt handles these, but I'd expect the test to depend on both models if users use ref. Could you confirm the current behaviour and let me know your thoughts on these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example

version: 2

models:
  - name: a

  - name: b
    tests:
      - dbt_utils.equality:
          compare_model: ref('a')
          compare_columns:
            - id

The test does depend on both tables a and b. In fact running both of these commands will trigger the test, which I didn't realise.

 0  ~/projects/demo/demo> dbt test --models a
11:02:13  Running with dbt=1.7.4
11:02:13  Registered adapter: duckdb=1.7.0
11:02:13  Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:13  
11:02:13  Concurrency: 1 threads (target='dev')
11:02:13  
11:02:13  1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:13  1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:13  
11:02:13  Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:13  
11:02:13  Completed successfully
11:02:13  
11:02:13  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

 0  ~/projects/demo/demo> dbt test --models b
11:02:17  Running with dbt=1.7.4
11:02:17  Registered adapter: duckdb=1.7.0
11:02:17  Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:17  
11:02:17  Concurrency: 1 threads (target='dev')
11:02:17  
11:02:17  1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:17  1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:17  
11:02:17  Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:17  
11:02:17  Completed successfully
11:02:17  
11:02:17  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

At the moment, at least when using the manifest load function combined with model tagging as a selection criteria, astronomer cosmos selects only one node to be the 'parent' of the test, which tells it which task to run after, the problem is that it's selecting the wrong one, imo, if we have to select one here it should be b but we're currently selecting a .

depends_on = manifest['nodes']['test.demo.dbt_utils_equality_b_id__ref_a_.54b4729a08']['depends_on']['nodes']
print(depends_on)
['model.demo.a', 'model.demo.b']

Where we select [0] instead of [-1]

What would be interesting would be then to include all of the dependent model tags, not just one of them, so that the test could appear after model a is run and after model b is run, but that seems like a disaster for the way astronomer cosmos works at the moment, where the graph is ordered by model dependencies, not tests dependencies (I think). That might be well outside of the scope of a change such as this, which really is just about picking the last from the list over the first in the list as being the more important model to test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough analysis, @adammarples, and detailed explanation. I appreciate it.

Given the observed behaviour of dbt, wouldn't it be more accurate to keep all the test node parents in depends_on - not only the -1?

The downside would be that if the user doesn't apply filters, the test would be duplicated in multiple model task groups (if using TestBehavior.AFTER_EACH).

However, the benefit is we would not miss a test if the user selected 'model.demo.a', as you illustrated.

Copy link
Collaborator

@jbandoro jbandoro Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tatiana that it would be better to keep all of the parents assigned, and during execution when running the tests, the user can use the ExecutionConfig.test_indirect_selection if they want to configure the dbt test behavior when using indirect selection.

@adammarples in the example you provided above if ExecutionConfig.test_indirect_selection = TestIndirectSelection.CAUTIOUS then it would only run the tests on model b. If model b depended on model a then a user could use TestIndirectSelection.BUILDABLE which would also run the test only for model b.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbt test --models a running both tests is interesting. I agree with what has already been said. Trying to stick to the behaviour of dbt and assigning both parents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbandoro it looks like this issue for me is actually solved with a piece of TestIndirectSelection config that I wasn't aware of then? Thank you, I guess there is no need for this PR?

cosmos/dbt/selector.py Outdated Show resolved Hide resolved
tests/dbt/test_graph.py Show resolved Hide resolved
@tatiana tatiana added the status:awaiting-author Issue/PR is under discussion and waiting for author's input label Dec 14, 2023
@ms32035
Copy link
Contributor

ms32035 commented Dec 20, 2023

@adammarples as you're working on this, I'd like to bring to your attention a case you're missing, and which we're running into recently:

  File "/home/airflow/.local/lib/python3.11/site-packages/cosmos/dbt/selector.py", line 298, in _should_include_node
    node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
                                       ~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Although not a dbt best practice, it's entirely possible to have a test with 0 dependencies, or model dependencies (test on sources)

@tatiana
Copy link
Collaborator

tatiana commented Jan 5, 2024

Hi @adammarples , this PR - and the discussion - was beneficial. Thank you. And thanks, @jbandoro, for giving an alternative solution to the problem.

Cosmos can improve its implementation towards tests with multiple parents - we should make it consistent with dbt. @adammarples, you make a great start - would you be interested in further contributing to this - or would you rather close this PR and let someone else take over the work?

@ms32035 I didn't realise that was possible! Although it relates to the original problem, I logged a separate ticket: #782. Would you be interested in contributing?

@tatiana
Copy link
Collaborator

tatiana commented May 17, 2024

Closing this PR in favour of the ticket: #978

@tatiana tatiana closed this May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:selector Related to selector, like DAG selector, DBT selector, etc area:testing Related to testing, like unit tests, integration tests, etc dbt:test Primarily related to dbt test command or functionality parsing:dbt_manifest Issues, questions, or features related to dbt_manifest parsing size:M This PR changes 30-99 lines, ignoring generated files. status:awaiting-author Issue/PR is under discussion and waiting for author's input
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_from_dbt_manifest is de-selecting valid test nodes
5 participants