Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select models which have tests with additional dependencies #751

Closed
wants to merge 14 commits into from
Closed
9 changes: 8 additions & 1 deletion cosmos/dbt/selector.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,14 @@ def _should_include_node(self, node_id: str, node: DbtNode) -> bool:
self.visited_nodes.add(node_id)

if node.resource_type == DbtResourceType.TEST:
node.tags = getattr(self.nodes.get(node.depends_on[0]), "tags", [])
dependency_ids = [node_id for node_id in node.depends_on if node_id.startswith("model.")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only care about model dependencies? Why don't we care about potential parent seeds/sources tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point and should be taken out

parent_id = dependency_ids[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to pick only one parent? Wouldn't it be better if we took into account all the parents?

In your PR description, you mention:

Almost all tests will have only one 'depends_on', but when they have two, ie. relationship, the true parent is the last one.

What is the definition of a true parent? If a test depends on two models, A and B, and they are run as independent Airflow tasks, why should the relationship with B be more relevant than the one with A?

Do we have any dbt docs on this?

Copy link
Contributor Author

@adammarples adammarples Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to pick one parent because tests in dbt implicitly belong to only one thing, they can only be declared inside a model/source/seed etc config block.

For example in the "model.jaffle_shop.orders" model, the relationship test with the customers model. This is a test on the orders model, defined under the - name: orders section of the model config, the test shouldn't run after the customers table is built too, the customers table hasn't failed just because someone wrote bad code in the orders model.

If there was explicit support in dbt for tests which belong to more than one model they would have to change it such that tests don't have to be declared only within a model config block such that they implicitly belong only to one model and the code dbt test --select orders runs the test but the code dbt test --select customers does not. I'm not aware of any such thing in dbt at the moment.

There was a bit of discussion about this here dbt-labs/dbt-core#6746

Copy link
Collaborator

@tatiana tatiana Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adammarples what about tests like the dbt_utils equality, and others, including:

  • equal_rowcount
  • cardinality_equality
  • fewer_rows_than
  • relationships_where

Example:

version: 2

models:
  - name: model_name
    tests:
      - dbt_utils.equality:
          compare_model: ref('other_table_name')
          compare_columns:
            - first_column
            - second_column

Even though they are declared part of the model_name block, they also depend on the other_table_name.

I didn't check how dbt handles these, but I'd expect the test to depend on both models if users use ref. Could you confirm the current behaviour and let me know your thoughts on these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example

version: 2

models:
  - name: a

  - name: b
    tests:
      - dbt_utils.equality:
          compare_model: ref('a')
          compare_columns:
            - id

The test does depend on both tables a and b. In fact running both of these commands will trigger the test, which I didn't realise.

 0  ~/projects/demo/demo> dbt test --models a
11:02:13  Running with dbt=1.7.4
11:02:13  Registered adapter: duckdb=1.7.0
11:02:13  Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:13  
11:02:13  Concurrency: 1 threads (target='dev')
11:02:13  
11:02:13  1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:13  1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:13  
11:02:13  Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:13  
11:02:13  Completed successfully
11:02:13  
11:02:13  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

 0  ~/projects/demo/demo> dbt test --models b
11:02:17  Running with dbt=1.7.4
11:02:17  Registered adapter: duckdb=1.7.0
11:02:17  Found 2 models, 1 test, 0 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
11:02:17  
11:02:17  Concurrency: 1 threads (target='dev')
11:02:17  
11:02:17  1 of 1 START test dbt_utils_equality_b_id__ref_a_ .............................. [RUN]
11:02:17  1 of 1 PASS dbt_utils_equality_b_id__ref_a_ .................................... [PASS in 0.03s]
11:02:17  
11:02:17  Finished running 1 test in 0 hours 0 minutes and 0.06 seconds (0.06s).
11:02:17  
11:02:17  Completed successfully
11:02:17  
11:02:17  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

At the moment, at least when using the manifest load function combined with model tagging as a selection criteria, astronomer cosmos selects only one node to be the 'parent' of the test, which tells it which task to run after, the problem is that it's selecting the wrong one, imo, if we have to select one here it should be b but we're currently selecting a .

depends_on = manifest['nodes']['test.demo.dbt_utils_equality_b_id__ref_a_.54b4729a08']['depends_on']['nodes']
print(depends_on)
['model.demo.a', 'model.demo.b']

Where we select [0] instead of [-1]

What would be interesting would be then to include all of the dependent model tags, not just one of them, so that the test could appear after model a is run and after model b is run, but that seems like a disaster for the way astronomer cosmos works at the moment, where the graph is ordered by model dependencies, not tests dependencies (I think). That might be well outside of the scope of a change such as this, which really is just about picking the last from the list over the first in the list as being the more important model to test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough analysis, @adammarples, and detailed explanation. I appreciate it.

Given the observed behaviour of dbt, wouldn't it be more accurate to keep all the test node parents in depends_on - not only the -1?

The downside would be that if the user doesn't apply filters, the test would be duplicated in multiple model task groups (if using TestBehavior.AFTER_EACH).

However, the benefit is we would not miss a test if the user selected 'model.demo.a', as you illustrated.

Copy link
Collaborator

@jbandoro jbandoro Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tatiana that it would be better to keep all of the parents assigned, and during execution when running the tests, the user can use the ExecutionConfig.test_indirect_selection if they want to configure the dbt test behavior when using indirect selection.

@adammarples in the example you provided above if ExecutionConfig.test_indirect_selection = TestIndirectSelection.CAUTIOUS then it would only run the tests on model b. If model b depended on model a then a user could use TestIndirectSelection.BUILDABLE which would also run the test only for model b.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbt test --models a running both tests is interesting. I agree with what has already been said. Trying to stick to the behaviour of dbt and assigning both parents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbandoro it looks like this issue for me is actually solved with a piece of TestIndirectSelection config that I wasn't aware of then? Thank you, I guess there is no need for this PR?

if len(dependency_ids) > 1:
logger.warning(
f"Test node {node.name} has more than one model dependency {dependency_ids}, selected tags from parent assumed to be {parent_id}."
tatiana marked this conversation as resolved.
Show resolved Hide resolved
)
parent_tags = getattr(self.nodes.get(parent_id), "tags", [])
node.tags = list(set(node.tags + parent_tags))

if not self._is_tags_subset(node):
return False
Expand Down
26 changes: 25 additions & 1 deletion tests/dbt/test_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -665,11 +665,35 @@ def test_tag_selected_node_test_exist():
assert len(dbt_graph.filtered_nodes) > 0

for _, node in dbt_graph.filtered_nodes.items():
assert node.tags == ["test_tag"]
assert "test_tag" in node.tags
tatiana marked this conversation as resolved.
Show resolved Hide resolved
if node.resource_type == DbtResourceType.MODEL:
assert node.has_test is True


def test_selects_relationship_test_from_depends_on():
project_config = ProjectConfig(
dbt_project_path=DBT_PROJECTS_ROOT_DIR / DBT_PROJECT_NAME, manifest_path=SAMPLE_MANIFEST
)
profile_config = ProfileConfig(
profile_name="test",
target_name="test",
profiles_yml_filepath=DBT_PROJECTS_ROOT_DIR / DBT_PROJECT_NAME / "profiles.yml",
)
render_config = RenderConfig(select=["tag:orders"])
execution_config = ExecutionConfig(dbt_project_path=project_config.dbt_project_path)
dbt_graph = DbtGraph(
project=project_config,
execution_config=execution_config,
profile_config=profile_config,
render_config=render_config,
)
dbt_graph.load_from_dbt_manifest()
assert (
"test.jaffle_shop.relationships_orders_customer_id__customer_id__ref_customers_.c6ec7f58f2"
in dbt_graph.filtered_nodes
), "test was deselected"


@pytest.mark.integration
@pytest.mark.parametrize("load_method", ["load_via_dbt_ls", "load_from_dbt_manifest"])
def test_load_dbt_ls_and_manifest_with_model_version(load_method):
Expand Down
3 changes: 2 additions & 1 deletion tests/sample/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -7757,7 +7757,8 @@
"schema": "public",
"sources": [],
"tags": [
"test_tag"
"test_tag",
"orders"
],
"unique_id": "model.jaffle_shop.orders",
"unrendered_config": {
Expand Down
Loading