Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement autodiscovery of project pipelines 🔍 #1706

Merged
merged 40 commits into from
Aug 26, 2022

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Jul 13, 2022

Description

Closes #1664

Development notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
It needs to be discovered after the test without configure project.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
This reverts commit 75f79eb.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
This reverts commit 79d7342.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman
Copy link
Member Author

@AntonyMilneQB Would love your opinion on something in particular. Currently, the Behave tests are failing, because one checks for a literal empty pipeline in text:

    And the pipeline should contain no nodes                          # features/steps/cli_steps.py:431
      Traceback (most recent call last):
        File "C:\tools\miniconda3\envs\kedro_builder\lib\site-packages\behave\model.py", line 1329, in run
          match.run(runner.context)
        File "C:\tools\miniconda3\envs\kedro_builder\lib\site-packages\behave\matchers.py", line 98, in run
          self.func(context, *args, **kwargs)
        File "features\steps\cli_steps.py", line 442, in check_empty_pipeline_exists
          assert '"__default__": pipeline([])' in pipeline_file.read_text("utf-8")
      AssertionError

I think there's two options here:

  1. Actually try running the pipeline, and get the expected error message associated with running a pipeline with no nodes.
  2. Instead of testing that there's an empty pipeline defined, just look for the pipeline summing code in the template.

I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly.


For my reference and for anybody else who's interested, I've managed to write this in a way that only works(-ish?) on Python 3.10. In order to resolve this:

  1. Add __init__.py file under pipelines directory in test fixture. importlib.resources.contents seems to pick up the subfolders without this only in Python 3.10+.
  2. Adding the __init__.py as is fails, because __pycache__ files get found and loaded. I think we need to use files instead of contents. However, since files is only introduced and recommended since Python 3.9, we need to install importlib-resources>=1.3.

Need to take care of some other things, but I'll get back to this on the weekend if I can.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks perfect so far ⭐ I'll respond to your questions in a separate post.

@antonymilne
Copy link
Contributor

I think there's two options here:

  1. Actually try running the pipeline, and get the expected error message associated with running a pipeline with no nodes.
  2. Instead of testing that there's an empty pipeline defined, just look for the pipeline summing code in the template.

I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly.

Very good question. Normally I would agree with you and am also not a fan of explicitly trying to match the string in the pipeline_registry.py file. Here I'm not so sure though.

We already have tests in run.feature that do as you describe here:

  Scenario: Run default python entry point without example code
    Given I have prepared a config file
    And I have run a non-interactive kedro new without starter
    When I execute the kedro command "run"
    Then I should get an error exit code
    And I should get an error message including "Pipeline contains no nodes"

Hence I think the tests in new_project.feature should be interpreted as just checking the project template that was created, and so the string matching is correct. But you could reasonably argue that this is not necessary given the above run.feature tests. Note that the test_plugin_starter in that file is new (though again maybe we don't need to do the string matching).

The general pattern currently seems to be that every kedro xyz command has its own xyz.feature file that tests specifically the behaviour of that command. According to that, it makes sense to have run.feature tests that actually do the kedro run and look at the output and new_project.feature tests that do kedro new and look at the template created.

But I don't mind changing how new_project.feature works or even removing those tests altogether if they don't seem to add anything - up to you really. @noklam looked at some of these e2e tests recently so might have an opinion on it.

@noklam
Copy link
Contributor

noklam commented Jul 19, 2022

Generally agree with @AntonyMilneQB about the pattern - one file per kedro xxx command, so run.feature should do a kedro run and catch the expected output, and new.feature should be checking on the files that are created.

In that spirit, if we are not changing this structure, I tend to favor just checking the pipeline object as it feels more consistent. Similar to unit test that should really just test 1 thing at a time, I would prefer we are just testing the behavior of one kedro xxx at a time here.

@antonymilne
Copy link
Contributor

Also, if we do maintain (or adopt) the pattern that kedro xyz corresponds to xyz.feature then let's rename the file new.feature (and similarly for any other misnamed files).

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman
Copy link
Member Author

Also, if we do maintain (or adopt) the pattern that kedro xyz corresponds to xyz.feature then let's rename the file new.feature (and similarly for any other misnamed files).

Renamed that one. ✅

@deepyaman
Copy link
Member Author

But I don't mind changing how new_project.feature works or even removing those tests altogether if they don't seem to add anything - up to you really. @noklam looked at some of these e2e tests recently so might have an opinion on it.

I removed it, because the test is really validating the difference between a pipeline defined with nodes and one without, but the written code is equivalent if we use autodiscovery (and thus there's nothing to distinguish). As you mentioned, the actual execution with empty pipeline is already tested, and I've updated the pipeline accordingly.

This reverts commit 3297763.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@antonymilne
Copy link
Contributor

P.S. @deepyaman do you still need to fork kedro to make a PR?

@deepyaman
Copy link
Member Author

P.S. @deepyaman do you still need to fork kedro to make a PR?

Nope! And I also see I don't need to explicitly at the DCO comment, if it's on the Kedro repo.

deepyaman and others added 2 commits August 5, 2022 11:00
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
Comment on lines +279 to +282
pipelines_dict = {"__default__": pipeline([])}
for pipeline_dir in importlib_resources.files(
f"{PACKAGE_NAME}.pipelines"
).iterdir():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know importlib_resource well enough, but this seems actually iterating the filesystem.

importlib_resources.files(
        f"{PACKAGE_NAME}.pipelines"
    )

If this is the case we need to break it down to do error-handling when importlib_resources.files( f"{PACKAGE_NAME}.pipelines" ) return ModuleNotFoundError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Sorry, I don't think I follow what you're suggesting here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Sorry for the late response, GH notifications don't work very well sometimes.

I think this needs to be fixed before this PR is merged. The code block below is outside of the try/catch block and it will stop the program running if the pipelines folder isn't exist due to ModuleNotFoundError.

pipeline_dir in importlib_resources.files(
        f"{PACKAGE_NAME}.pipelines"
    ).iterdir():

This assume importlib_resources.files( f"{PACKAGE_NAME}.pipelines" ) always return a vaild Path object, thus the subsequent iterdir() method call. However, if the f"{PACKAGE_NAME}.pipelines" doesn't exist, you will get a ModuleNotFoundError instead. ModuleNotFoundError.iterdir() will be a bug and not handled correctly with the current logic.

Some of our starters only have a package_name/pipeline.py file but not a package_name/pipelines/pipeline.py

To fix this, we probably need something like this

try: 
  module = importlib_resources.files(f"{PACKAGE_NAME}.pipelines")

except ModuleNotFoundError:  # Or maybe some more?
  logger.info(f"{PACKAGE_NAME}.pipelines does not exists")

else:
   # Continue with the iterations...
   for pipeline_dir in module.iterdir():
       ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #1706 (comment), I assumed that it would be OK to not support for now:

However, since this single file (compared to multiple directories) is a bit of an unusual/special case I don't mind too much if we just don't support it. Or we can come back and add support for it later. it does seem like a good case to support though since it's the "simplified kedro project structure" that I think we should still support.

I don't think we should swallow the ModuleNotFoundError, since find_pipelines() (currently) does expect a pipelines directory it can iterate over. That being said, it's also a choose-to-use feature, that's enabled by default when you do kedro new. Something with a simplified structure without a pipelines directory would simply not use this.

I personally am not even convinced that the simplified structure needs to be supported. If you just have a pipeline.py file, with the one create_pipeline() method in there, why do you need a magical find_pipelines()? The value of find pipelines is that you don't need to register each pipeline when you create it; in this case, just be explicit and define the one pipeline in the registry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Thanks for the comment! I think you are right that the error shouldn't be swallowed.

My view is slightly different:

  1. I have a stronger view that we should support the simple pipeline.py case.
  2. The error should still be handled and throw a more explicit error, since the module is not imported by the user, it's not obvious why it is failing. It's also better if we mentioned this expected structure in the docstring

I mainly see this feature as a beginner-friendly feature, something that "just work" even if you don't understand how pipeline registration works, as long as you are creating a project with kedro new and kedro create pipelines (no need to edit the pipeline_registry.py).

We also have pandas-iris and other iris starters that actually follow this simplified structure. It is strange to me a beginner-friendly feature is not available for a simple project.

def register_pipelines() -> Dict[str, Pipeline]:
    """Register the project's pipelines.
    Returns:
        A mapping from a pipeline name to a ``Pipeline`` object.
    """

    return {
        "__default__": create_pipeline(),
    }

If we support case 1, then 99% of the cases are covered and we won't see the error very often (Point 2) unless users are doing some custom thing (not using standard starters or they have custom structure/logic), which is okay to throw an error.

Imagine a beginner that starts with the pandas-iris project, and he found modular_pipeline later, how likely is the user going to find out the find_pipelines function afterward? It would be a smoother experience if the pipeline is discoverable after he does kedro pipeline create. It's probably because I always view starters and Kedro's CLI as entry points for feature discovery.

I think a unified workflow is quite important, especially if we are focusing on expanding the user base with less software engineering background. We shouldn't try too hard to explain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Ideally they should be refactored into folders at the end, but I don't think it's unreasonable to combine both of them as __default__.

In that case, find_pipelines is always there and users don't have to update this block, and they can also do kedro run --pipeline=modular_pipeline.

  pipelines = find_pipelines()
  pipelines["__default__"] = sum(pipelines.values())
  return pipelines

Antony has a different view that it should stop as soon as pipeline.py is found.

Copy link
Contributor

@antonymilne antonymilne Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good discussion here! I started 👍 the ones that I agreed with and that ended up being all the points made. I'm happy to merge this without supporting the simple project structure (single pipeline.py file) case at all and I'll make a new issue so we can discuss this further.

@deepyaman please could you update the starters which do have the multiple pipeline structure (probably just spaceflights from memory) to use the new find_pipelines functionality? 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1812.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AntonyMilneQB Thank you for chiming in. I'm in the process of updating to support single pipelines, but would also be more than happy to merge first and push it as a separate PR.

@deepyaman please could you update the starters which do have the multiple pipeline structure (probably just spaceflights from memory) to use the new find_pipelines functionality? 🙂

Doesn't this need to be released first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, this must be released first - just wanted to mention it while I remembered.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested it with the current commit with GitPod by adding the find_packages fuction.

By default, the find_packages looks for package_name/pipelines module, however these demo project only have a package_name.pipeline.py file but not pipelines module. As a result I get an error like this.

ModuleNotFoundError: No module named 'project.pipelines'

This can be reproduced by changing the pipeline_registry file, note that the Kedro's project name is project.

"""Project pipelines."""
from typing import Dict
 
from kedro.pipeline import Pipeline
from project.pipeline import create_pipeline
from kedro.framework.project import find_pipelines  # <-- New added  

def register_pipelines() -> Dict[str, Pipeline]:
	pipelines = find_pipelines()
	return pipelines

This is caused by the lines

│ ❱ 280for pipeline_dir in importlib_resources.files(                                         │
│   281 │   │   f"{PACKAGE_NAME}.pipelines"                                                        │
│   282 │   ).iterdir():     

@antonymilne
Copy link
Contributor

antonymilne commented Aug 8, 2022

Very good point @noklam, thanks for the testing! I had thought of this case before but then forgotten all about it... Ideally I think a very simple project like pandas-iris that doesn't have multiple pipelines but just a single pipeline.py file would also "just work" with find_packages out of the box. i.e. correct behaviour here would be:

  1. check if there's a pipeline.py file that exposes create_pipeline; if yes then use it. Just define __default__, no need for any other pipeline names
  2. otherwise look for pipelines directory and register them as done now

However, since this single file (compared to multiple directories) is a bit of an unusual/special case I don't mind too much if we just don't support it. Or we can come back and add support for it later. it does seem like a good case to support though since it's the "simplified kedro project structure" that I think we should still support.

Sorry I forgot to put this in the ticket! 🤦

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman
Copy link
Member Author

Thanks all for reviewing! I think this is more-or-less ready to go, with 3 open comments/confirmations. Now off to writing the associated docs PR!

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition to Kedro! Thanks for implementing it @deepyaman 😄 ⭐

@noklam
Copy link
Contributor

noklam commented Aug 26, 2022

Quick question, does kedro new (with no starters) generate a project with find_pipelines?

@antonymilne
Copy link
Contributor

Quick question, does kedro new (with no starters) generate a project with find_pipelines?

Yes.

@deepyaman deepyaman merged commit e0ad56e into kedro-org:main Aug 26, 2022
@deepyaman deepyaman deleted the feat/find-pipelines branch August 26, 2022 16:03
deepyaman added a commit to deepyaman/kedro that referenced this pull request Sep 12, 2022
* Implement autodiscovery of project pipelines 🔍

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Trawl the `pipelines` directory and return as keys

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Rename "pipelines" to avoid confusion with globals

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Handle case where `create_pipeline` is not exposed

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Warn if `create_pipeline` does not return pipeline

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Handle cases where errors occur upon pipeline load

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Move discovery tests to subfolder to unbreak tests

It needs to be discovered after the test without configure project.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Revert "Move discovery tests to subfolder to unbreak tests"

This reverts commit 75f79eb.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Move test without configure project to run earlier

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Revert "Move test without configure project to run earlier"

This reverts commit 79d7342.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Add `find_pipelines` call to the pipeline registry

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Leverage `files()` API instead of older `contents`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Renamed new_project.feature to new for consistency

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Remove tests that validate the registry's contents

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Remove now-unnecessary ModuleNotFoundError handler

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Move test without configure project to run earlier

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Revert "Move test without configure project to run earlier"

This reverts commit 3297763.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Fix bug wherein fixture is not passed to unit test

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Debug whether and when configure_project is called

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Try and force a reimport of `pipelines` global var

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Ensure that `kedro.framework.project` was reloaded

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Change the way to unload library due to test error

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Clean up kedro.framework.project load and reformat

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Fix typo in requirements.txt

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>

* Update e2e test starter to auto-discover pipelines

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Add space to find_pipelines to improve readability

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Add missing import to registry of the test starter

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Remove unused import from a test pipeline registry

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Make registry behave as before but using discovery

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Update pipeline name in end-to-end test CLI config

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Print underlying error when cannot load a pipeline

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Update RELEASE.md

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement pipeline autodiscovery
5 participants