-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define modular pipelines in config #3904
Conversation
…gaconf_config.py config_patterns Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
…gaconf_config.py config_patterns Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
Hi @bpmeek thank you for this hard work. The topic of what can be declarative and what should be imperative is something that the team have wrestled with for a long long time. I think my personal preference is that this should be a plug-in rather than in kedro core, but keen to get other's perspective |
Hi @bpmeek, echoing @datajoely in thanking you the work you've put in this PR! I think in the future it might be better to open an issue or discussion first to talk about fundamental changes like this. As Joel mentioned, this is a topic we've talked about a lot and we even had an internal version of Kedro where you could define pipelines and nodes in yml instead of python. With the solution you're proposing I'm wondering why you'd like to define modular pipelines in config, but not regular pipelines. And what about nodes? I'd also like to understand more what the problem is that will be solved by this? |
Really appreciative, @datajoely and @merelcht for the recognition on my work. @merelcht, with this solution I'm defining modular pipelines, and pipeline chains, because of a recent project I was working on that kept growing and my pipeline registry ended up becoming a huge file that was hard to manage. Had I been able to take a couple of the larger more complex pipelines out and moved them into a config file I think it would've been much easier to work with. That being said, I actually do really like the idea of also defining pipelines in a config file as well, as I think this could simplify the problem of "I already have this pipeline but I need to skip this node" @datajoely I am amenable to the idea of having this in a plug-in instead of kedro core, especially considering I couldn't come up with a simpler way to access |
now this is interesting - the canonical Kedro way of doing this is to use pipeline algebra: def register_pipelines() -> Dict[str, Pipeline]:
return {
"my_pipeline" : pipeline_1
"my_pipeline2" : pipeline_1 - nodes_to_exclude
} It's not super clear what your YAML should look like from the PR, what do you imagine this look like? |
@datajoely thanks, I wasn't aware it was something that simple!
I have a really basic example in the docstring for modular_pipeline = pipeline(
pipe=data_processing,
inputs={
input_a: input_b
},
namespace=modular_example
) then your <name_doesn't_matter>:
pipe:
- data_processing
inputs:
input_a: input_b
namespace: modular_example Also, my additional two cents. I think it makes additional sense to add this functionality as, in my mind, changing the input/output of a modular pipeline is no different conceptually than changing an entry in the data catalog. In other words, code changes are changes to the underlying functionality and config changes are changes to how the underlying functionality interacts with each other. |
Okay interesting - I think I have a better way that we could this (@noklam @astrojuanlu would love your view too) So you can actually run a 1:1 from the Kedro CLI with a YAML using
In my view, if we were to make the Pipeline CLI slicing commands more expressive then configuring through YAML would improve automatically through the existing mechamnism
|
Thanks @bpmeek for this contribution! On the topic of pipelines as YAML, xref #650, #3872 On the topic of pipeline algebra, I must say this is something most of our users struggle with #3233 Without having looked at the code, I have two comments to make:
|
I think I don't get this part well. Would breaking In my mind, the py -> YAML is mostly 1:1 translation from a dictionary to YAML and does not simplify anything. It may make editing easier(?), but worse for reading (you lose all the IDE features that can brings you to the definition/references).
This is more interesting to me and I think it does have some value. The slicing is already supported by Kedro Pipeline API, but the CLI only limited to certain expression. |
Description
As a Kedro user I have always wanted to be able to define modular pipelines in a config file. I believe that doing so will reduce the likelihood of a user inadvertently impacting a pipeline other than the one intended.
Development notes
I have adjusted two files:
comegaconf_config.py
now has"pipelines": ["pipelines.yml"],
added to itsconfig_patterns
kedro/framework/project/__init__.py
has an additional functionfrom_config
that can read a config entry and build the entry into a modular pipeline which is then returned.A portion of
find_pipelines
was abstracted into a helper function_get_pipeline_obj
for use in both functions.This was tested using
make test
and manually tested with the default example_pipeline created when creating a new Kedro project.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file