Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pipeline autodiscovery #1664

Closed
antonymilne opened this issue Jul 1, 2022 · 10 comments · Fixed by #1706
Closed

Implement pipeline autodiscovery #1664

antonymilne opened this issue Jul 1, 2022 · 10 comments · Fixed by #1706
Assignees

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Jul 1, 2022

Following #1284 and a discussion with Ivan, here's what we're going to do:

1. Define a function find_pipelines

In kedro.framework.project.__init__.py, something like this:

def find_pipelines():
    registered_pipelines = {}
    pipeline in Path(f"{PACKAGE_NAME}.pipelines").iterdir():
        importlib.import_module(pipeline)
        if hasattr(pipeline, "create_pipeline"):
            registered_pipelines[pipeline] = pipeline.create_pipeline()
    registered_pipelines["__default__"] = pipeline([])
    return registered_pipelines

Name is TBD (note I changed from previously-suggested autoregister because it doesn't actually register any more, just finds them). It's also like setup.py's find_packages here.

2. Call this function from project template and all starters

In pipeline_registry.py:

from kedro.framework.project import find_pipelines

def register_pipelines() -> Dict[str, Pipeline]:
    """Register the project's pipelines.

    MAKE THIS DOCSTRING MORE INFORMATIVE

    Returns:
        A mapping from a pipeline name to a ``Pipeline`` object.

    """
    pipelines = find_pipelines()
    pipelines["__default__"] = sum(pipelines.values())
    return pipelines

3. Add documentation, tests, etc.

@datajoely
Copy link
Contributor

@imdoroshenko we should sync up on this and make sure our assumptions still hold

@deepyaman
Copy link
Member

Thought about this in general quite a bit over the past couple days. Overall, quite aligned. I like the fact that, by just automating discovery, and even so in an optional way, it does one thing well without forcing any workflow.

Name is TBD (note I changed from previously-suggested autoregister because it doesn't actually register any more, just finds them). It's also like setup.py's find_packages here.

I like the name. :)

    importlib.import_module(pipeline)
        if hasattr(pipeline, "create_pipeline"):
            registered_pipelines[pipeline] = pipeline.create_pipeline()

Just confirming--this maintains the behavior since 0.17.3 wherein the registry is lazily loaded, right? I assume, since it's in register_pipelines, and it's that function whose execution is delayed, this should be fine?

pipelines["__default__"] = sum(pipelines.values())

Nit: Need a case for no pipelines being defined; guess just wrap this in an if pipelines block.

@merelcht merelcht moved this to To Do in Kedro Framework Jul 4, 2022
@antonymilne
Copy link
Contributor Author

antonymilne commented Jul 4, 2022

@deepyaman thanks for the comments! I'm much happier about this solution too, and with the definition of __default__ being project-side it automatically handles the case that we talked about.

Good question about the lazy loading behaviour. This should indeed be maintained since no code in find_pipelines will be executed until register_pipelines is executed.

And very good point about the case of no pipelines being defined. What do you think of defining pipelines["__default__"] = pipeline([]) inside find_pipelines? This way we keep the "real" definition of __default__ in pipeline_registry.py but no need for an if there.

Either that or I would just explicitly do:

pipelines["__default__"] = sum(pipelines.values()) if pipelines else pipeline([])

@deepyaman
Copy link
Member

What do you think of defining pipelines["__default__"] = pipeline([]) inside find_pipelines? This way we keep the "real" definition of __default__ in pipeline_registry.py but no need for an if there.

Clever; I like it!

@deepyaman deepyaman self-assigned this Jul 8, 2022
@deepyaman deepyaman moved this from To Do to In Progress in Kedro Framework Jul 12, 2022
@deepyaman
Copy link
Member

@AntonyMilneQB I don't see significant documentation on the the pipeline registry. Do you think it makes sense to add a docs/source/nodes_and_pipelines/pipeline_registry.md, or it would be OK to just add a short section on autoregistration in docs/source/nodes_and_pipelines/modular_pipelines.md?

@antonymilne
Copy link
Contributor Author

antonymilne commented Jul 22, 2022

Personally I would do a whole new file docs/source/nodes_and_pipelines/pipeline_registry.md, but up to you really. I think the modular pipeline documentation needs some modifying as it stands, so I wouldn't be surprised if these things move around a bit anyway. So long as it's linked from and consistent with other relevant sections of the docs (e.g. linked from the instructions on "How to run a pipeline"; update spaceflights tutorial so there's no need to edit pipeline_registry.py any more) I don't mind much where it goes.

Also I know this is a fairly big documentation update, so do feel free to do bit by bit or in a separate PR if easier for you 🙂 🙏

@datajoely
Copy link
Contributor

I did the last modular pipeline update refresh so happy to help plan the changes

@deepyaman
Copy link
Member

Personally I would do a whole new file docs/source/nodes_and_pipelines/pipeline_registry.md, but up to you really. I think the modular pipeline documentation needs some modifying as it stands, so I wouldn't be surprised if these things move around a bit anyway. So long as it's linked from and consistent with other relevant sections of the docs (e.g. linked from the instructions on "How to run a pipeline"; update spaceflights tutorial so there's no need to edit pipeline_registry.py any more) I don't mind much where it goes.

Great! I pinged @MerelTheisenQB on Friday, since I thought you were on holiday, and she was aligned in terms of creating a new page for the pipeline registry.

Also I know this is a fairly big documentation update, so do feel free to do bit by bit or in a separate PR if easier for you 🙂 🙏

I think the functionality is good to go on my current PR; just need to add the documentation. I'll plan to:

  1. Write basic pipeline registry documentation, and merge that PR.
  2. Update it with a section on autodiscovery, and integrate into my open PR.

Alternatively, I'd also be happy to merge the current code (after review) and add the 1-2 documentation PRs on top of that.

@antonymilne
Copy link
Contributor Author

@deepyaman Either approach works for me - up to you. Just mark the PR(s) as ready for review whenever you want me to have a look through and I'll do so.

@deepyaman
Copy link
Member

@deepyaman Either approach works for me - up to you. Just mark the PR(s) as ready for review whenever you want me to have a look through and I'll do so.

Great! In that case, I've marked my current PR as ready for review. :)

@merelcht merelcht moved this from In Progress to In Review in Kedro Framework Aug 1, 2022
Repository owner moved this from In Review to Done in Kedro Framework Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants