-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select models which have tests with additional dependencies #751
Changes from 12 commits
03385a1
dbe00df
7037eba
e9a0e0a
fdf6ade
53f78e0
e7d0a02
e4caa7c
d5ed1c7
92ff078
c6bff40
f83c4bf
213620f
6c4bfe4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool: | |
self.visited_nodes.add(node_id) | ||
|
||
if node.resource_type == DbtResourceType.TEST: | ||
node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", []) | ||
dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")] | ||
parent_id = dependency_ids[-1] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we have to pick only one parent? Wouldn't it be better if we took into account all the parents? In your PR description, you mention:
What is the definition of a true parent? If a test depends on two models, A and B, and they are run as independent Airflow tasks, why should the relationship with B be more relevant than the one with A? Do we have any dbt docs on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have to pick one parent because tests in dbt implicitly belong to only one thing, they can only be declared inside a model/source/seed etc config block. For example in the "model.jaffle_shop.orders" model, the relationship test with the customers model. This is a test on the orders model, defined under the If there was explicit support in dbt for tests which belong to more than one model they would have to change it such that tests don't have to be declared only within a model config block such that they implicitly belong only to one model and the code There was a bit of discussion about this here dbt-labs/dbt-core#6746 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @adammarples what about tests like the dbt_utils equality, and others, including:
Example:
Even though they are declared part of the I didn't check how dbt handles these, but I'd expect the test to depend on both models if users use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this example version: 2
models:
- name: a
- name: b
tests:
- dbt_utils.equality:
compare_model: ref('a')
compare_columns:
- id The test does depend on both tables a and b. In fact running both of these commands will trigger the test, which I didn't realise. 0 ~/projects/demo/demo> dbt test --models a
11:02:13 Running with dbt=1.7.4
11:02:13 Registered adapter: duckdb=1.7.0
11:02:13 Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:13
11:02:13 Concurrency: 1 threads (target='dev')
11:02:13
11:02:13 1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:13 1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:13
11:02:13 Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:13
11:02:13 Completed successfully
11:02:13
11:02:13 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
0 ~/projects/demo/demo> dbt test --models b
11:02:17 Running with dbt=1.7.4
11:02:17 Registered adapter: duckdb=1.7.0
11:02:17 Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:17
11:02:17 Concurrency: 1 threads (target='dev')
11:02:17
11:02:17 1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:17 1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:17
11:02:17 Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:17
11:02:17 Completed successfully
11:02:17
11:02:17 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 At the moment, at least when using the manifest load function combined with model tagging as a selection criteria, astronomer cosmos selects only one node to be the 'parent' of the test, which tells it which task to run after, the problem is that it's selecting the wrong one, imo, if we have to select one here it should be depends_on = manifest['nodes']['test.demo.dbt_utils_equality_b_id__ref_a_.54b4729a08']['depends_on']['nodes']
print(depends_on)
Where we select [0] instead of [-1] What would be interesting would be then to include all of the dependent model tags, not just one of them, so that the test could appear after model a is run and after model b is run, but that seems like a disaster for the way astronomer cosmos works at the moment, where the graph is ordered by model dependencies, not tests dependencies (I think). That might be well outside of the scope of a change such as this, which really is just about picking the last from the list over the first in the list as being the more important model to test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the thorough analysis, @adammarples, and detailed explanation. I appreciate it. Given the observed behaviour of dbt, wouldn't it be more accurate to keep all the test node parents in The downside would be that if the user doesn't apply filters, the test would be duplicated in multiple model task groups (if using However, the benefit is we would not miss a test if the user selected There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @tatiana that it would be better to keep all of the parents assigned, and during execution when running the tests, the user can use the @adammarples in the example you provided above if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbandoro it looks like this issue for me is actually solved with a piece of TestIndirectSelection config that I wasn't aware of then? Thank you, I guess there is no need for this PR? |
||
if len(dependency_ids) > 1: | ||
logger.warning( | ||
f"Test node {node.name} has more than one model dependency {dependency_ids}, selected tags from parent assumed to be {parent_id}." | ||
tatiana marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
parent_tags = getattr(self.nodes.get(parent_id), "tags", []) | ||
node.tags = list(set(node.tags + parent_tags)) | ||
|
||
if not self._is_tags_subset(node): | ||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we only care about model dependencies? Why don't we care about potential parent seeds/sources tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point and should be taken out