Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize test selection #2203

Closed
drewbanin opened this issue Mar 16, 2020 · 9 comments
Closed

Rationalize test selection #2203

drewbanin opened this issue Mar 16, 2020 · 9 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Mar 16, 2020

Describe the challenge

Test selection is "novel and exciting" whereas it should instead be boring and straightforward. Let's fix that.

@drewbanin to complete this issue w/ info about what to change and how to change.

See also #2172, #2167

@drewbanin drewbanin added bug Something isn't working enhancement New feature or request labels Mar 16, 2020
@drewbanin drewbanin self-assigned this Mar 16, 2020
@drewbanin drewbanin added this to the Octavius Catto milestone Mar 19, 2020
@drewbanin drewbanin mentioned this issue Mar 31, 2020
11 tasks
@drewbanin
Copy link
Contributor Author

How does test selection work?

Somewhat surprisingly, test selection is basically identical to model/snapshot/seed/etc selection! When selecting tests, dbt will:

  1. select all nodes based on the supplied selectors
  2. include all tests that reference those selectors
  3. filter for only nodes that match the specified resource types and tags (ie. data or schema tags)

Step two here is really surprising in the general case! dbt will do extra work to find all tests that depend directly on selected nodes only to either:

  1. include the tests and throw away the nodes for dbt test invocations
  2. include the nodes and throw away the tests for dbt run|seed|snapshot invocations

Forgetting the implementation details for a moment, the big takeaway here is that test selection occurs fundamentally at a node-level, and then dbt does one extra hop to find the tests that depend on those selected nodes directly.

This is reflected in the name of the --models flag for dbt test. Fundamentally, we are selecting models to test, not tests themselves!

How could test selection work

Part of this is a naming/UX problem. The --models flag will select from all nodes, not just models! I propose:

  • The current behavior of --models is remapped to a new --select flag
  • A --models flag is provided which only runs tests on selected models
  • --seeds and --snapshots flags are also added to the dbt test command

Beyond that, there are some substantive and wonky thinks about test selection that we should address:

Tags

Selecting schema vs. data tests by tags is a bad idea. We should add a first-class property to test nodes indicating if they are schema tests or data tests, and perform selection based on that property. Test nodes should no longer be supplied with these auto-tags. We could also make these proper selectors instead of CLI args. Coupled with #2167, I think we could replicate the existing functionality in dbt test.

Selectors

dbt should support selecting tests by:

  1. The models that they reference
  2. Their "path" -- that is the path of the source file containing the tests, not the path of the models they reference
  3. The test type (for schema tests). It's probably less likely that someone would want to select all unique tests, but this could make a lot of sense for custom schema tests
  4. The configured test severity

Ideally, these would all be supported via the same system. Take this as a vignette and not a spec, but I'm picturing selectors like:

dbt test --select type:unique
dbt test --select severity:warning
dbt test --select references:my_model

If no selector is provided, dbt would default to the references: selector.

One big consideration is that the graph modifiers + and @ are typically applied after the selector is applied. For example, dbt run --models my_model+ will select my_model as well as all of its descendants. This is pretty insensible in test-land, as something like:

dbt test --select references:my_model+

would, semantically, select all of the tests that reference my_model, as well as all of the descendants of those tests. That's not really the desired behavior here though. Instead, we want to select all of the tests that reference my_model, or a descendant of my_model.

So, what do we do about this?? One "fix" could be inject tests inline into the DAG. Picture something like:

+-----------+    +-----------+    +------------+
| my_model  +--->| my_test   +--->| other_model|
+-----------+    +-----------+    +------------+

A key benefit here is that a selector like

dbt test --select references:my_model+

would "just work". In this example, tests applied to other_model would indeed be descendants of my_test. While this would be more internally consistent, I don't think it's a justifiable change for that reason alone. Instead, the much larger benefit here would be that we can (finally) combine dbt run and dbt test into a single command which runs models and tests in DAG-order. There is a lot more to consider there, but I do want to call out that it would be a good and helpful change of behavior for other reasons not totally related to test selection.

I think that our ultimate answer to test selection must include at least one of the following:

  1. A big change to how tests are embedded into the DAG (shown above)
  2. A realization that tests are special, and acceptance of the fact that dbt must make them a special case (the "do-nothing" approach)
  3. A verbose and user-unfriendly syntax for specifying test selectors

An example of a user-unfriendly test selector which is unambiguous and consistent:

# select tests on my_model, or descendants of my_model
dbt test --select references:(my_model+)

I think this syntax is decidedly worse than the current implementation dbt test --models my_model+, but I could be convinced that it's not so user-unfriendly, and I bet we could add some sugar for more typical use-cases.

There's certainly more thinking to do here, but I wanted to open up this line of inquiry to discussion before going further down the rabbit hole. @beckjake, @jtcohen6, what say you?

@drewbanin drewbanin removed their assignment Apr 6, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 7, 2020

On its face, without a full appreciation for how hard it would be to implement, I support the change to how tests participate in the DAG. Most implications of this feel intuitive:

  • Each schema test is a child of the model on which it's configured
  • Each data test is a child of the models it references (potentially one or many)
    • N.B. There are some schema tests (e.g. relationship, equality) which reference other models. Those are more like data tests in this sense.
  • If model_a is a parent to model_b, a schema test on model_a is also a parent to model_b:
+-----------+    +-----------+    +------------+
| model_a   +--->| unique    +--->| model_b |
+-----------+    +-----------+    +------------+

A hypothetical dbt build --select model_a model_b would execute steps in the following order:

  1. Materialize model_a
  2. Run the schema test unique_model_a
  3. Materialize model_b

If the schema test failed, it would skip all downstream nodes, including model_b.

Some implications feel a little less intuitive.

In the same scenario above, let's say a data test references both model_a and model_b. If we have a model_c that depends on model_a, I suppose model_c would have to be a child of the data test, and thereby also a child of model_b. That means the build order for the DAG of all nodes would be model_a --> model_b --> data test --> model_c:

+-----------+    +-----------+    +------------+    +-----------+
| model_a   +--->| model_b   +--->| data_test  +--->| model_c   |
+-----------+    +-----------+    +------------+    +-----------+
      |                                 ^
      |                                 |
      +---------------------------------+

But the build order for a DAG that includes only model nodes could run model_b and model_c in parallel:

+-----------+    +-----------+
| model_a   +--->| model_b   +
+-----------+    +-----------+
      |
      |          +-----------+
      +--------->| model_c   |
                 +-----------+

@drewbanin
Copy link
Contributor Author

@jtcohen6 I think your last example actually reflects the desired behavior!

+-----------+    +-----------+    +------------+    +-----------+
| model_a   +--->| model_b   +--->| data_test  +--->| model_c   |
+-----------+    +-----------+    +------------+    +-----------+
      |                                 ^
      |                                 |
      +---------------------------------+

If data_test fails here, we'd indeed want to skip running model_c in the sort of default/general case if we're doing a combined run + test. I also think that if we build a subdag that only includes models, dbt would automatically generate the ideal DAG:

+-----------+    +-----------+
| model_a   +--->| model_b   +
+-----------+    +-----------+
      |
      |          +-----------+
      +--------->| model_c   |
                 +-----------+

We do this by building a transitive closure over the DAG and then pruning nodes that we want to exclude from the run.

Some other thoughts around edge-cases:

  • I think ephemeral models would work just fine in this approach. They would get baked into the DAG, but more or less "skipped" at run-time. Tests on the ephemeral model would run before models that depend on the ephemeral models.
  • We could totally make the build failure behavior configurable. By default, nodes will skip their descendants if they fail to build, but we can pretty easily turn that on/off for tests

@beckjake curious what you think about all of this

@beckjake
Copy link
Contributor

beckjake commented Apr 8, 2020

I really don't love the idea of tests being inline in the DAG. I agree that it solves a number of problems around dbt run+test, but it makes the oft-requested #2052 behavior really hard to implement. If someone says dbt ls --select model_a+1, does that mean we only select model_a's tests? Even worse, does that mean that if you add tests to a model that didn't previously have them, the semantics of model_a+1 changes from model_b to model_a's tests? That sounds really undesirable!

I think a more compelling thing would be to note tests as "special children", and in the dbt run+test case we just require running all the test nodes of a given node before marking it "done". Basically, treat node + its direct test descendants as a single node in the graph. To handle the case @jtcohen6 described, I'd say that model_a -> model_b -> data_test would get treated as a "single node" by the part of the code that manages the thread pool. Since there's a strict dependency ordering enforced anyway, it's not big deal to force all three of those nodes to be handled by the same thread.

That's a bit more convoluted internally as we'd have to rework some of the "node done" completeness stuff, but I think it will result in a more intuitive/consistent outcome. This would be an opportunity to revisit how dbt handles model/job scheduling, and potentially make it less opaque.

@drewbanin
Copy link
Contributor Author

Ok - i think that's fair - direct child selectors are a really good consideration! I'm happy for tests to remain special -- it just sounds like we need to make them more special than they currently are.

So, is the goal for this issue to change the current implementation of how tests are selected out of the DAG? While I'm in favor of the change you're describing, I don't think this will necessarily have any semantics of the command line interface for selecting tests!

@beckjake
Copy link
Contributor

beckjake commented Apr 9, 2020

Sorry for the essay, this took me a long time to think through and write about "intelligently".

I guess I've become convinced that tests are "special", but I do believe their specialness should get changed.

Let me lay out my reasoning here. In broad strokes, these are what I see as a representative sample of the key considerations:

  1. Someday we want to implement dbt run+test, which I will summarize as: During run+test, test nodes must run after all the nodes they depend upon are finished, but before any nodes they don't depend upon.
  2. dbt test --select my_model should work basically like it does now: everything that tests my_model is run.
  3. we want users to be able to select more fine-grained slices of tests than just "tests for this model" when they're using dbt test - "all uniqueness tests"
  4. We want to someday implement a parameter for model selection like Direct child model selector #2052. An assertion on my part: This parameter applying to tests themselves is not really very valuable.
  5. You should be able to do something like dbt test --select source:my_source.* and have it only run your source tests. Bonus: Being able to say "only run any tests that whose requirements are exlclusively in the selected set". The benefit there would be skipping tests that assert relationships between sources and refs, for example. Maybe this is more of an "advanced node selection syntax" problem.
  6. dbt ls should reliably be able to tell me what nodes will be run (bonus: in what order!) for all of dbt run, dbt test, dbt run+test. Requiring parameters and type restrictions is acceptable (so dbt ls --resource-type=test is a fine way of saying whatever "dbt test" will do).
  7. It should be possible to implement Advanced node selection syntax #2172 and Model intersection syntax #2167 in any new design
  8. It would be nice if there was a simple, consistent explanation of how selecting your nodes worked. It would be ok (though less great) if that explanation started with "it depends if you're running or testing or both".

Thinking about all that, it seems to me that node selection should work similar to how it currently does, except it should be more context-dependent. I'd like to conceptually split it into two parts:

  • "node selection". I'm keeping this name!
  • "task-based graph creation". Horrible name!

"Node selection" would behave exactly like node selection today (barring any changes in #2172 and #2167), except:

  • it would not do the collected.update(tests) it currently does
  • it would not do an extra adding-in of ephemeral nodes.
    That means that in the most common case node selection wouldn't pick up any tests. Node selection in this case returns the set of selected unique IDs.

"task-based graph creation" would be implemented per-task.

  • "dbt test" would find all the test descendants of the selected nodes that descended from all the selected nodes, and return those.
    • there would be a flag here for exclusivity, probably (filter out tests that descend from any models not selected)
    • note that the graph dbt test generates has no edges, because tests can't depend upon one another!
  • "dbt compile" / "dbt run" / "dbt snapshot" would create a graph the same way they currently do.
  • "dbt run+test" would create a graph like I describe below. I'm not sure this is exactly the transitive closure as there are some edges that don't exist in the regular transitive closure... see below

The wrinkle in my eyes is ephemerals. I know this is pretty vague and hand-wavy, but we should either lazily compile them or compute them up-front and rip through compiling them all immediately before we even start running. Either way, ephemeral nodes shouldn't be special-cased in node selection.

implementing dbt run+test's graph creation

Instead of the usual graph queue, you'd use a special one that inserts tests into the graph like this. Imagine you've selected nodes A, B, C, and D from the graph with your query, and constitute a subgraph that looks like this:

A -> B ->D
|        ^
---> C --|

And then you have tests on:
A: T(A)
B: T(B)
C: T(C)
A and B: T(A, B)
A and C: T(A, C)
A and D: T(A, D)
B and C: T(B, C)
B and D: T(B, D)
C and D: T(C, D)

A -> T(A) -> B -> (T(B), T(A, B)) -> T(B, C) -> D -> (T(A, D), T(B, D), T(C, D))
      |                                 ^
      -----> C--> (T(C), T(A, C)) ------|

I'm sure there's a fancy graph theory word for this, too!

@drewbanin
Copy link
Contributor Author

Thanks for this writeup @beckjake - it's very helpful!!

"task-based graph creation" would be implemented per-task.

Yep, I totally buy this. Let's call these things:

  • node selection
  • DAG creation

They are very related, but I agree, they're different tasks and we should not conflate the two!

The wrinkle in my eyes is ephemerals. I know this is pretty vague and hand-wavy, but we should either lazily compile them or compute them up-front and rip through compiling them all immediately before we even start running. Either way, ephemeral nodes shouldn't be special-cased in node selection.

I agree it would be great to not special case ephemerals in node selection. I'm open to either of the approaches you've outlined here.

Last, the run+test DAG creation diagram you shared makes sense to me. I'll get @jtcohen6 to read some books on graph theory so that we can pick an appropriate name for this construct :)

@drewbanin
Copy link
Contributor Author

@beckjake to open a new issue to split out node selection from DAG creation

@drewbanin
Copy link
Contributor Author

great discussion! Closing in favor of #2328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants