-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select models which have tests with additional dependencies #751
Conversation
✅ Deploy Preview for amazing-pothos-a3bca0 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #751 +/- ##
==========================================
+ Coverage 93.27% 93.28% +0.01%
==========================================
Files 55 55
Lines 2499 2503 +4
==========================================
+ Hits 2331 2335 +4
Misses 168 168 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, @adammarples! I'm glad you're improving this part of the code.
I'll loop in @edgarasnavickas (@navedgaras), who previously worked on this, so we can get his thoughts as well.
I left some questions/comments inline
cosmos/dbt/selector.py
Outdated
@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool: | |||
self.visited_nodes.add(node_id) | |||
|
|||
if node.resource_type == DbtResourceType.TEST: | |||
node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", []) | |||
dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only care about model dependencies? Why don't we care about potential parent seeds/sources tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point and should be taken out
cosmos/dbt/selector.py
Outdated
@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool: | |||
self.visited_nodes.add(node_id) | |||
|
|||
if node.resource_type == DbtResourceType.TEST: | |||
node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", []) | |||
dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")] | |||
parent_id = dependency_ids[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to pick only one parent? Wouldn't it be better if we took into account all the parents?
In your PR description, you mention:
Almost all tests will have only one 'depends_on', but when they have two, ie. relationship, the true parent is the last one.
What is the definition of a true parent? If a test depends on two models, A and B, and they are run as independent Airflow tasks, why should the relationship with B be more relevant than the one with A?
Do we have any dbt docs on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to pick one parent because tests in dbt implicitly belong to only one thing, they can only be declared inside a model/source/seed etc config block.
For example in the "model.jaffle_shop.orders" model, the relationship test with the customers model. This is a test on the orders model, defined under the - name: orders
section of the model config, the test shouldn't run after the customers table is built too, the customers table hasn't failed just because someone wrote bad code in the orders model.
If there was explicit support in dbt for tests which belong to more than one model they would have to change it such that tests don't have to be declared only within a model config block such that they implicitly belong only to one model and the code dbt test --select orders
runs the test but the code dbt test --select customers
does not. I'm not aware of any such thing in dbt at the moment.
There was a bit of discussion about this here dbt-labs/dbt-core#6746
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adammarples what about tests like the dbt_utils equality, and others, including:
- equal_rowcount
- cardinality_equality
- fewer_rows_than
- relationships_where
Example:
version: 2
models:
- name: model_name
tests:
- dbt_utils.equality:
compare_model: ref('other_table_name')
compare_columns:
- first_column
- second_column
Even though they are declared part of the model_name
block, they also depend on the other_table_name
.
I didn't check how dbt handles these, but I'd expect the test to depend on both models if users use ref
. Could you confirm the current behaviour and let me know your thoughts on these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example
version: 2
models:
- name: a
- name: b
tests:
- dbt_utils.equality:
compare_model: ref('a')
compare_columns:
- id
The test does depend on both tables a and b. In fact running both of these commands will trigger the test, which I didn't realise.
0 ~/projects/demo/demo> dbt test --models a
11:02:13 Running with dbt=1.7.4
11:02:13 Registered adapter: duckdb=1.7.0
11:02:13 Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:13
11:02:13 Concurrency: 1 threads (target='dev')
11:02:13
11:02:13 1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:13 1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:13
11:02:13 Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:13
11:02:13 Completed successfully
11:02:13
11:02:13 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
0 ~/projects/demo/demo> dbt test --models b
11:02:17 Running with dbt=1.7.4
11:02:17 Registered adapter: duckdb=1.7.0
11:02:17 Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:17
11:02:17 Concurrency: 1 threads (target='dev')
11:02:17
11:02:17 1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:17 1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:17
11:02:17 Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:17
11:02:17 Completed successfully
11:02:17
11:02:17 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
At the moment, at least when using the manifest load function combined with model tagging as a selection criteria, astronomer cosmos selects only one node to be the 'parent' of the test, which tells it which task to run after, the problem is that it's selecting the wrong one, imo, if we have to select one here it should be b
but we're currently selecting a
.
depends_on = manifest['nodes']['test.demo.dbt_utils_equality_b_id__ref_a_.54b4729a08']['depends_on']['nodes']
print(depends_on)
['model.demo.a', 'model.demo.b']
Where we select [0] instead of [-1]
What would be interesting would be then to include all of the dependent model tags, not just one of them, so that the test could appear after model a is run and after model b is run, but that seems like a disaster for the way astronomer cosmos works at the moment, where the graph is ordered by model dependencies, not tests dependencies (I think). That might be well outside of the scope of a change such as this, which really is just about picking the last from the list over the first in the list as being the more important model to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thorough analysis, @adammarples, and detailed explanation. I appreciate it.
Given the observed behaviour of dbt, wouldn't it be more accurate to keep all the test node parents in depends_on
- not only the -1
?
The downside would be that if the user doesn't apply filters, the test would be duplicated in multiple model task groups (if using TestBehavior.AFTER_EACH
).
However, the benefit is we would not miss a test if the user selected 'model.demo.a'
, as you illustrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tatiana that it would be better to keep all of the parents assigned, and during execution when running the tests, the user can use the ExecutionConfig.test_indirect_selection
if they want to configure the dbt test behavior when using indirect selection.
@adammarples in the example you provided above if ExecutionConfig.test_indirect_selection = TestIndirectSelection.CAUTIOUS
then it would only run the tests on model b. If model b depended on model a then a user could use TestIndirectSelection.BUILDABLE
which would also run the test only for model b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbt test --models a
running both tests is interesting. I agree with what has already been said. Trying to stick to the behaviour of dbt and assigning both parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbandoro it looks like this issue for me is actually solved with a piece of TestIndirectSelection config that I wasn't aware of then? Thank you, I guess there is no need for this PR?
@adammarples as you're working on this, I'd like to bring to your attention a case you're missing, and which we're running into recently:
Although not a dbt best practice, it's entirely possible to have a test with 0 dependencies, or model dependencies (test on sources) |
Hi @adammarples , this PR - and the discussion - was beneficial. Thank you. And thanks, @jbandoro, for giving an alternative solution to the problem. Cosmos can improve its implementation towards tests with multiple parents - we should make it consistent with dbt. @adammarples, you make a great start - would you be interested in further contributing to this - or would you rather close this PR and let someone else take over the work? @ms32035 I didn't realise that was possible! Although it relates to the original problem, I logged a separate ticket: #782. Would you be interested in contributing? |
Closing this PR in favour of the ticket: #978 |
Description
Tests such as the relationship test will fail to be selected if we are selecting models with tags and using the manifest loading method.
Related Issue(s)
closes #719
Breaking Change?
Almost all tests will have only one 'depends_on', but when they have two, ie.
relationship
, the true parent is the last one.load_from_dbt_manifest
now considers the primary selectable parent model of a test to be the last item in its 'depends_on' list, rather than the first.This would break any tests being loaded by tag selection using
load_from_dbt_manifest
which somehow depended on a model being the first in that list rather than the last.Checklist