Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define modular pipelines in config #3904

Closed
wants to merge 5 commits into from

Conversation

bpmeek
Copy link

@bpmeek bpmeek commented May 30, 2024

Description

As a Kedro user I have always wanted to be able to define modular pipelines in a config file. I believe that doing so will reduce the likelihood of a user inadvertently impacting a pipeline other than the one intended.

Development notes

I have adjusted two files:
comegaconf_config.py now has "pipelines": ["pipelines.yml"], added to its config_patterns
kedro/framework/project/__init__.py has an additional function from_config that can read a config entry and build the entry into a modular pipeline which is then returned.
A portion of find_pipelines was abstracted into a helper function _get_pipeline_obj for use in both functions.
This was tested using make test and manually tested with the default example_pipeline created when creating a new Kedro project.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

bpmeek added 5 commits May 30, 2024 13:42
…gaconf_config.py config_patterns

Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
…gaconf_config.py config_patterns

Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
Signed-off-by: Brandon Meek <brandon.p.meek@gmail.com>
@bpmeek bpmeek requested a review from merelcht as a code owner May 30, 2024 19:56
@datajoely
Copy link
Contributor

Hi @bpmeek thank you for this hard work. The topic of what can be declarative and what should be imperative is something that the team have wrestled with for a long long time.

I think my personal preference is that this should be a plug-in rather than in kedro core, but keen to get other's perspective

@merelcht
Copy link
Member

Hi @bpmeek, echoing @datajoely in thanking you the work you've put in this PR! I think in the future it might be better to open an issue or discussion first to talk about fundamental changes like this.

As Joel mentioned, this is a topic we've talked about a lot and we even had an internal version of Kedro where you could define pipelines and nodes in yml instead of python.

With the solution you're proposing I'm wondering why you'd like to define modular pipelines in config, but not regular pipelines. And what about nodes? I'd also like to understand more what the problem is that will be solved by this?

@bpmeek
Copy link
Author

bpmeek commented May 31, 2024

Really appreciative, @datajoely and @merelcht for the recognition on my work.
Off the bat, you're 100% right @merelcht, and normally I would have gone ahead and opened an issue as I have in the past, but I was in a unique situation where I had the time to do this work so went ahead and did it instead of starting with an issue as I will most likely be using this either way in the future.

@merelcht, with this solution I'm defining modular pipelines, and pipeline chains, because of a recent project I was working on that kept growing and my pipeline registry ended up becoming a huge file that was hard to manage. Had I been able to take a couple of the larger more complex pipelines out and moved them into a config file I think it would've been much easier to work with.

That being said, I actually do really like the idea of also defining pipelines in a config file as well, as I think this could simplify the problem of "I already have this pipeline but I need to skip this node"

@datajoely I am amenable to the idea of having this in a plug-in instead of kedro core, especially considering I couldn't come up with a simpler way to access pipelines.yml

@datajoely
Copy link
Contributor

That being said, I actually do really like the idea of also defining pipelines in a config file as well, as I think this could simplify the problem of "I already have this pipeline but I need to skip this node"

now this is interesting - the canonical Kedro way of doing this is to use pipeline algebra:

def register_pipelines() -> Dict[str, Pipeline]:

    return {
         "my_pipeline" : pipeline_1
         "my_pipeline2" : pipeline_1 - nodes_to_exclude
    }

It's not super clear what your YAML should look like from the PR, what do you imagine this look like?

@bpmeek
Copy link
Author

bpmeek commented May 31, 2024

@datajoely thanks, I wasn't aware it was something that simple!

It's not super clear what your YAML should look like from the PR, what do you imagine this look like?

I have a really basic example in the docstring for from_config but if your modular pipeline looked like:

modular_pipeline = pipeline(
    pipe=data_processing,
    inputs={
        input_a: input_b
    },
    namespace=modular_example
)

then your pipelines.yml would look like:

<name_doesn't_matter>:
  pipe:
    - data_processing
  inputs:
    input_a: input_b
  namespace: modular_example  

Also, my additional two cents. I think it makes additional sense to add this functionality as, in my mind, changing the input/output of a modular pipeline is no different conceptually than changing an entry in the data catalog. In other words, code changes are changes to the underlying functionality and config changes are changes to how the underlying functionality interacts with each other.

@datajoely
Copy link
Contributor

Okay interesting - I think I have a better way that we could this (@noklam @astrojuanlu would love your view too)

So you can actually run a 1:1 from the Kedro CLI with a YAML using kedro run --config some_file.yml

In my view, if we were to make the Pipeline CLI slicing commands more expressive then configuring through YAML would improve automatically through the existing mechamnism

  • Perhaps if we made the following possible: kedro run --pipeline="a,b,c" --exclude=d
  • It would solve the issue you're talking about (since it would be inherited by the CLI --config option) as well as the person who I mentioned on Slack above.

@astrojuanlu
Copy link
Member

Thanks @bpmeek for this contribution!

On the topic of pipelines as YAML, xref #650, #3872

On the topic of pipeline algebra, I must say this is something most of our users struggle with #3233

Without having looked at the code, I have two comments to make:

  1. I don't think pipelines.yml should be in conf/ at all. Pipeline structure is hardly "configuration" in the way we usually think about it (so much so that we are debating whether even dataset types should be config, see Universal Kedro deployment (Part 1) - Separate external and applicative configuration to make Kedro cloud native #770). Instead, if we wanted users to define pipelines in YAML, I'd rather see those yamls shipped with the source:
❯ tree
.
├── README.md
├── conf
│   ├── README.md
│   ├── base
│   ├── local
│   └── logging.yml
...
├── src
│   ├── package_name
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── pipelines
│   │   │   ├── __init__.py
│   │   │   ├── data_processing
│   │   │   │   ├── __init__.py
│   │   │   │   ├── nodes.py
│   │   │   │   └── pipeline.yaml  # <--------------
  1. I 💯 % agree with @datajoely, this should start its life as a Kedro plugin. I'll drop some ideas in Define modular pipelines with config file kedro-plugins#713

@noklam
Copy link
Contributor

noklam commented Jun 6, 2024

Had I been able to take a couple of the larger more complex pipelines out and moved them into a config file I think it would've been much easier to work with.

I think I don't get this part well. Would breaking pipeline_registry.py into smaller files in import them from pipeline_registry.py solve this problem already? I am curious to see how the pipeline_registry.py looks like as I haven't seen crazily complex registry before, or do you mean pipeline.py instead?

In my mind, the py -> YAML is mostly 1:1 translation from a dictionary to YAML and does not simplify anything. It may make editing easier(?), but worse for reading (you lose all the IDE features that can brings you to the definition/references).

Pipeline slicing

This is more interesting to me and I think it does have some value. The slicing is already supported by Kedro Pipeline API, but the CLI only limited to certain expression.

@ankatiyar ankatiyar added the Community Issue/PR opened by the open-source community label Jun 10, 2024
@bpmeek bpmeek closed this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants