-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract manifest generation from tasks #6565
Conversation
ea5841a
to
a0a4d5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
# if there are no args, the decorator was used without params @decorator | ||
# otherwise, the decorator was called with params @decorator(arg) | ||
if len(args0) == 0: | ||
return outer_wrapper | ||
return outer_wrapper(args0[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better more pythonic way to have a decorator accept args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, I was able to confirm a run works as expected locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this @stu-k! I left some comments to contextualize a few of the interrelated pieces that this PR is touching.
I called out two required changes that I'd consider blocking to merge:
- Now that we've converted more manifest-using commands, we'll need
@requires.manifest
on those commands as well - The
build
command/task needs different behavior forcompile_manifest
Then, I think we should open up a separate ticket to take another look at abstracting manifest compilation / graph generation from tasks.
@@ -47,7 +42,7 @@ def _run_unsafe(self) -> agate.Table: | |||
|
|||
def run(self) -> RunOperationResultsArtifact: | |||
start = datetime.utcnow() | |||
self._runtime_initialize() | |||
self.compile_manifest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current behavior: run-operation
does not "compile" the manifest, and it does not support node selection, i.e. dbt run-operation --select ... --exclude ...
. (The former is a prerequisite for the latter.)
There is a compelling proposal for why run-operation
should support node selection: #5005. (This might also be a step toward eventually supporting macros that ref()
ephemeral models in run-operation
, though this change by itself doesn't get us there. That's a long, separate story.)
The only downside is that compilation (= resolving ephemeral models + constructing the DAG) takes time, especially in large projects. There is no caching/reuse/diffing between invocations, in that way that there is for project parsing, so it's a fixed cost every time.
So: I don't think this is a bad change, I just want to call out that it is a change.
def parse(ctx, **kwargs): | ||
"""Parses the project and provides information on performance""" | ||
click.echo(f"`{inspect.stack()[0][3]}` called\n flags: {ctx.obj['flags']}") | ||
# manifest generation and writing happens in @requires.manifeset | ||
return None, True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this PR mean that:
dbt parse
will always write the manifest (= overwritemanifest.json
). IMO that's a good change! I'd opened an issue for it recently: [CT-1759]dbt parse
should (over)writemanifest.json
by default #6534. So we can also remove--write-manifest
as a parameter for this command, and inparams.py
.- We no longer have the option of
dbt parse --compile
, which allows us to output detailed performance timing that includes the DAG construction step (= resolving ephemeral model references + linking nodes/edges).
For the second point: I'd be happy with kicking that out of scope for this PR, and opening a tech debt ticket to track it. There's a bigger idea here: "Abstract manifest compilation / graph generation from tasks." Idea being, move compile_manifest()
out from task initialization, into either a conditional step within @requires.manifest
, or as its own @requires.graph
step. Then, tasks would accept the graph as an argument.
Considerations there:
- Manifest compilation / graph generation does require the adapter, and therefore
RuntimeConfig
, as an input - It mutates the manifest and returns a
Graph
- The
build
task produces + uses a differentgraph
from all other tasks, with extra test edges - If we stored the
graph
onctx.obj
, and passed it into each task explicitly, we would unlock the ability for programmatic invocations to cache + reuse thegraph
between invocations. That could make a difference for performance in very large projects. (With the important caveat that the external application would be responsible for distinguishing between standard andbuild
-specific graphs.)
Finally, a separate proposal, out of scope for this PR but highly relevant: The parse
task should return the Manifest
here, instead of None
(#6547). That could be as simple as changing this last line to:
def parse(ctx, **kwargs) -> Tuple[Manifest, bool]:
"""Parses the project and provides information on performance"""
# manifest generation and writing happens in @requires.manifest
return ctx.obj["manifest"], True
And then:
from dbt.cli.main import dbtRunner
dbt = dbtRunner()
manifest, _ = dbt.invoke(['parse'])
There's a bit more refinement we should do on that proposal first, to make sure it's an idea we're happy with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the context. I had spoken with Gerda about the function of the existing ParseTask
class, and she had said I could take out the tracking. Adding it back in shouldn't be difficult, and could be done conditionally.
For your proposal of returning the manifest from the task directly, I think that's something we need to discuss on the exec team. I believe we should have a standard output for each of these functions, perhaps a dict with a result
or message
or whatever key should contain result, in case we want to add meta information to that result object later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had spoken with Gerda about the function of the existing
ParseTask
class, and she had said I could take out the tracking. Adding it back in shouldn't be difficult, and could be done conditionally.
This is fine with me too. Gerda consolidated all the events in this task, anyway, to just ParseCmdOut
(in the main
branch). We'll just want to remove that event now that it's no longer being called anywhere. For simplicity, let's track that in a separate issue, rather than try to do branch merging shenanigans here.
The point I was making above was around, dbt parse --compile
would also conditionally include the step around manifest compilation / graph generation, which can be very slow (as we've seen in recent reports/issues from folks with large projects). Let's track that in the new issue, "Abstract manifest compilation / graph generation from tasks."
For your proposal of returning the manifest from the task directly, I think that's something we need to discuss on the exec team. I believe we should have a standard output for each of these functions, perhaps a dict with a
result
ormessage
or whatever key should contain result, in case we want to add meta information to that result object later.
Agreed! Let's keep discussing in the separate linked issue. The biggest considerations are:
- How to keep the
results
relatively consistent for all commands (although different tasks already return differently typed result objects) - Should we expose the full manifest, as part of our public API? Just a part of it? Should we call this "experimental" functionality (liable to change in future versions)? We have the power to decide these things!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to call these changes out here:
https://docs.getdbt.com/guides/migration/versions/upgrading-to-v1.5
Did the proposed "Abstract manifest compilation / graph generation from tasks." issue ever get created?
In the meantime, I'm going to delete both of these from params.py
as part of #6546 since they don't appear to do anything:
dbt-core/core/dbt/cli/params.py
Lines 44 to 49 in 3f76f82
compile_parse = click.option( | |
"--compile/--no-compile", | |
envvar=None, | |
help="TODO: No help text currently available", | |
default=True, | |
) |
dbt-core/core/dbt/cli/params.py
Lines 491 to 496 in 3f76f82
write_manifest = click.option( | |
"--write-manifest/--no-write-manifest", | |
envvar=None, | |
help="TODO: No help text currently available", | |
default=True, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @stu-k!
Could I ask you to open two follow-up issues that came up in conversation?
- Remove
ParseCmdOut
, now that theparse
command isn’t doing any of its own custom event firing / logging. (Or, conditionally add the logging back.) - Abstracting graph generation into a decorator, rather than a step buried within task initialization. We can plan to tackle it in the next few months, whether as part of "API-ification" (phase 2) or "library-ification" (support for programmatic invocations /
dbt-server
).
def docs_generate(ctx, **kwargs): | ||
"""Generate the documentation website for your project""" | ||
config = RuntimeConfig.from_parts(ctx.obj["project"], ctx.obj["profile"], ctx.obj["flags"]) | ||
task = GenerateTask(ctx.obj["flags"], config) | ||
task = GenerateTask(ctx.obj["flags"], ctx.obj["runtime_config"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry that I'm catching this post-merge but - should GenerateTask
be passed the manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, it is inheriting CompileTask
, the reason it need manifest is that generate task would go to the warehouse and fetch all info about the models in current dbt project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! docs generate
also performs a full compile
of all manifest nodes — unless --select
is passed, in which case only those nodes; or if --no-compile
is passed, in which case it doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the issue is being addressed by @aranke in the cli test work here.
resolves #6357
resolves #6534
Description
Abstract manifest generation from directly inside the task classes to their instantiation.
Checklist
changie new
to create a changelog entry