Select models which have tests with additional dependencies #751

adammarples · 2023-12-06T15:30:30Z

Description

Tests such as the relationship test will fail to be selected if we are selecting models with tags and using the manifest loading method.

Related Issue(s)

closes #719

Breaking Change?

Almost all tests will have only one 'depends_on', but when they have two, ie. relationship, the true parent is the last one.

load_from_dbt_manifest now considers the primary selectable parent model of a test to be the last item in its 'depends_on' list, rather than the first.

This would break any tests being loaded by tag selection using load_from_dbt_manifest which somehow depended on a model being the first in that list rather than the last.

Checklist

I have made corresponding changes to the documentation (if required)
I have added tests that prove my fix is effective or that my feature works

This reverts commit dbe00df.

This reverts commit e9a0e0a.

netlify · 2023-12-06T15:30:36Z

✅ Deploy Preview for amazing-pothos-a3bca0 ready!

Name	Link
🔨 Latest commit	`213620f`
🔍 Latest deploy log	https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/657af27701122100082a33fc
😎 Deploy Preview	https://deploy-preview-751--amazing-pothos-a3bca0.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2023-12-07T11:09:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (82e8db9) 93.27% compared to head (6c4bfe4) 93.28%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #751      +/-   ##
==========================================
+ Coverage   93.27%   93.28%   +0.01%     
==========================================
  Files          55       55              
  Lines        2499     2503       +4     
==========================================
+ Hits         2331     2335       +4     
  Misses        168      168

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tatiana

Thanks for the contribution, @adammarples! I'm glad you're improving this part of the code.

I'll loop in @edgarasnavickas (@navedgaras), who previously worked on this, so we can get his thoughts as well.

I left some questions/comments inline

tatiana · 2023-12-14T11:44:34Z

cosmos/dbt/selector.py

@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool:
        self.visited_nodes.add(node_id)

        if node.resource_type == DbtResourceType.TEST:
-            node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
+            dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")]


Why do we only care about model dependencies? Why don't we care about potential parent seeds/sources tags?

This is a good point and should be taken out

tatiana · 2023-12-14T11:47:45Z

cosmos/dbt/selector.py

@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool:
        self.visited_nodes.add(node_id)

        if node.resource_type == DbtResourceType.TEST:
-            node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
+            dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")]
+            parent_id = dependency_ids[-1]


Why do we have to pick only one parent? Wouldn't it be better if we took into account all the parents?

In your PR description, you mention:

Almost all tests will have only one 'depends_on', but when they have two, ie. relationship, the true parent is the last one.

What is the definition of a true parent? If a test depends on two models, A and B, and they are run as independent Airflow tasks, why should the relationship with B be more relevant than the one with A?

Do we have any dbt docs on this?

We have to pick one parent because tests in dbt implicitly belong to only one thing, they can only be declared inside a model/source/seed etc config block.

For example in the "model.jaffle_shop.orders" model, the relationship test with the customers model. This is a test on the orders model, defined under the - name: orders section of the model config, the test shouldn't run after the customers table is built too, the customers table hasn't failed just because someone wrote bad code in the orders model.

If there was explicit support in dbt for tests which belong to more than one model they would have to change it such that tests don't have to be declared only within a model config block such that they implicitly belong only to one model and the code dbt test --select orders runs the test but the code dbt test --select customers does not. I'm not aware of any such thing in dbt at the moment.

There was a bit of discussion about this here dbt-labs/dbt-core#6746

@adammarples what about tests like the dbt_utils equality, and others, including:

equal_rowcount

cardinality_equality

fewer_rows_than

relationships_where

Example:

version: 2 models: - name: model_name tests: - dbt_utils.equality: compare_model: ref('other_table_name') compare_columns: - first_column - second_column

Even though they are declared part of the model_name block, they also depend on the other_table_name.

I didn't check how dbt handles these, but I'd expect the test to depend on both models if users use ref. Could you confirm the current behaviour and let me know your thoughts on these cases?

In this example

version: 2 models: - name: a - name: b tests: - dbt_utils.equality: compare_model: ref('a') compare_columns: - id

The test does depend on both tables a and b. In fact running both of these commands will trigger the test, which I didn't realise.

0 ~/projects/demo/demo> dbt test --models a 11:02:13 Running with dbt=1.7.4 11:02:13 Registered adapter: duckdb=1.7.0 11:02:13 Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models 11:02:13 11:02:13 Concurrency: 1 threads (target='dev') 11:02:13 11:02:13 1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN] 11:02:13 1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s] 11:02:13 11:02:13 Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s). 11:02:13 11:02:13 Completed successfully 11:02:13 11:02:13 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 0 ~/projects/demo/demo> dbt test --models b 11:02:17 Running with dbt=1.7.4 11:02:17 Registered adapter: duckdb=1.7.0 11:02:17 Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models 11:02:17 11:02:17 Concurrency: 1 threads (target='dev') 11:02:17 11:02:17 1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN] 11:02:17 1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s] 11:02:17 11:02:17 Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s). 11:02:17 11:02:17 Completed successfully 11:02:17 11:02:17 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

At the moment, at least when using the manifest load function combined with model tagging as a selection criteria, astronomer cosmos selects only one node to be the 'parent' of the test, which tells it which task to run after, the problem is that it's selecting the wrong one, imo, if we have to select one here it should be b but we're currently selecting a .

depends_on = manifest['nodes']['test.demo.dbt_utils_equality_b_id__ref_a_.54b4729a08']['depends_on']['nodes'] print(depends_on)

['model.demo.a', 'model.demo.b']

Where we select [0] instead of [-1]

What would be interesting would be then to include all of the dependent model tags, not just one of them, so that the test could appear after model a is run and after model b is run, but that seems like a disaster for the way astronomer cosmos works at the moment, where the graph is ordered by model dependencies, not tests dependencies (I think). That might be well outside of the scope of a change such as this, which really is just about picking the last from the list over the first in the list as being the more important model to test.

Thank you for the thorough analysis, @adammarples, and detailed explanation. I appreciate it.

Given the observed behaviour of dbt, wouldn't it be more accurate to keep all the test node parents in depends_on - not only the -1?

The downside would be that if the user doesn't apply filters, the test would be duplicated in multiple model task groups (if using TestBehavior.AFTER_EACH).

However, the benefit is we would not miss a test if the user selected 'model.demo.a', as you illustrated.

I agree with @tatiana that it would be better to keep all of the parents assigned, and during execution when running the tests, the user can use the ExecutionConfig.test_indirect_selection if they want to configure the dbt test behavior when using indirect selection.

@adammarples in the example you provided above if ExecutionConfig.test_indirect_selection = TestIndirectSelection.CAUTIOUS then it would only run the tests on model b. If model b depended on model a then a user could use TestIndirectSelection.BUILDABLE which would also run the test only for model b.

dbt test --models a running both tests is interesting. I agree with what has already been said. Trying to stick to the behaviour of dbt and assigning both parents.

@jbandoro it looks like this issue for me is actually solved with a piece of TestIndirectSelection config that I wasn't aware of then? Thank you, I guess there is no need for this PR?

cosmos/dbt/selector.py

tests/dbt/test_graph.py

ms32035 · 2023-12-20T00:53:36Z

@adammarples as you're working on this, I'd like to bring to your attention a case you're missing, and which we're running into recently:

  File "/home/airflow/.local/lib/python3.11/site-packages/cosmos/dbt/selector.py", line 298, in _should_include_node
    node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
                                       ~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Although not a dbt best practice, it's entirely possible to have a test with 0 dependencies, or model dependencies (test on sources)

tatiana · 2024-01-05T12:18:32Z

Hi @adammarples , this PR - and the discussion - was beneficial. Thank you. And thanks, @jbandoro, for giving an alternative solution to the problem.

Cosmos can improve its implementation towards tests with multiple parents - we should make it consistent with dbt. @adammarples, you make a great start - would you be interested in further contributing to this - or would you rather close this PR and let someone else take over the work?

@ms32035 I didn't realise that was possible! Although it relates to the original problem, I logged a separate ticket: #782. Would you be interested in contributing?

tatiana · 2024-05-17T10:31:35Z

Closing this PR in favour of the ticket: #978

adammarples added 6 commits December 6, 2023 15:24

add order tag to order table in manifest

03385a1

fix selector logic for test nodes

dbe00df

add tes for new logic

7037eba

Revert "fix selector logic for test nodes"

e9a0e0a

This reverts commit dbe00df.

add answer

fdf6ade

Revert "Revert "fix selector logic for test nodes""

53f78e0

This reverts commit e9a0e0a.

adammarples requested a review from a team as a code owner December 6, 2023 15:30

adammarples requested a review from a team December 6, 2023 15:30

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 6, 2023

adammarples had a problem deploying to external December 6, 2023 15:30 — with GitHub Actions Error

adammarples mentioned this pull request Dec 6, 2023

load_from_dbt_manifest is de-selecting valid test nodes #719

Open

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

e7d0a02

pre-commit-ci bot had a problem deploying to external December 6, 2023 15:31 Error

adammarples added 2 commits December 6, 2023 15:40

allow sources

e4caa7c

merge main

d5ed1c7

adammarples had a problem deploying to external December 6, 2023 15:43 — with GitHub Actions Error

adammarples added 3 commits December 7, 2023 09:46

switch tag order

92ff078

correct test assertion

c6bff40

Merge remote-tracking branch 'upstream/main' into test-select-model

f83c4bf

adammarples temporarily deployed to external December 7, 2023 09:49 — with GitHub Actions Inactive

tatiana reviewed Dec 14, 2023

View reviewed changes

tatiana added the status:awaiting-author Issue/PR is under discussion and waiting for author's input label Dec 14, 2023

allow seeds/sources

213620f

adammarples had a problem deploying to external December 14, 2023 12:18 — with GitHub Actions Error

Merge remote-tracking branch 'upstream/main' into test-select-model

6c4bfe4

adammarples temporarily deployed to external December 14, 2023 12:18 — with GitHub Actions Inactive

tatiana mentioned this pull request Jan 5, 2024

Error when there is a dbt test without any node dependency #782

Closed

tatiana mentioned this pull request May 17, 2024

Support associating tests to multiple parents, if they have multiple parents #978

Closed

tatiana closed this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select models which have tests with additional dependencies #751

Select models which have tests with additional dependencies #751

adammarples commented Dec 6, 2023 •

edited

Loading

netlify bot commented Dec 6, 2023 •

edited

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

tatiana left a comment

tatiana Dec 14, 2023

adammarples Dec 14, 2023

tatiana Dec 14, 2023

adammarples Dec 14, 2023 •

edited

Loading

tatiana Dec 15, 2023 •

edited

Loading

adammarples Dec 15, 2023

tatiana Dec 15, 2023

jbandoro Dec 15, 2023 •

edited

Loading

joppevos Dec 18, 2023

adammarples Jan 2, 2024

ms32035 commented Dec 20, 2023

tatiana commented Jan 5, 2024 •

edited

Loading

tatiana commented May 17, 2024

Select models which have tests with additional dependencies #751

Select models which have tests with additional dependencies #751

Conversation

adammarples commented Dec 6, 2023 • edited Loading

Description

Related Issue(s)

Breaking Change?

Checklist

netlify bot commented Dec 6, 2023 • edited Loading

✅ Deploy Preview for amazing-pothos-a3bca0 ready!

codecov bot commented Dec 7, 2023 • edited Loading

Codecov Report

tatiana left a comment

Choose a reason for hiding this comment

tatiana Dec 14, 2023

Choose a reason for hiding this comment

adammarples Dec 14, 2023

Choose a reason for hiding this comment

tatiana Dec 14, 2023

Choose a reason for hiding this comment

adammarples Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

tatiana Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

adammarples Dec 15, 2023

Choose a reason for hiding this comment

tatiana Dec 15, 2023

Choose a reason for hiding this comment

jbandoro Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

joppevos Dec 18, 2023

Choose a reason for hiding this comment

adammarples Jan 2, 2024

Choose a reason for hiding this comment

ms32035 commented Dec 20, 2023

tatiana commented Jan 5, 2024 • edited Loading

tatiana commented May 17, 2024

adammarples commented Dec 6, 2023 •

edited

Loading

netlify bot commented Dec 6, 2023 •

edited

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

adammarples Dec 14, 2023 •

edited

Loading

tatiana Dec 15, 2023 •

edited

Loading

jbandoro Dec 15, 2023 •

edited

Loading

tatiana commented Jan 5, 2024 •

edited

Loading