-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement autodiscovery of project pipelines 🔍 #1706
Conversation
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
2197b46
to
5485c40
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
bbe9763
to
d7898a8
Compare
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
It needs to be discovered after the test without configure project. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
This reverts commit 75f79eb. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
This reverts commit 79d7342. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@AntonyMilneQB Would love your opinion on something in particular. Currently, the Behave tests are failing, because one checks for a literal empty pipeline in text:
I think there's two options here:
I'm inclined to think 1 is better, but want to make sure I'm understanding the purpose of this test properly. For my reference and for anybody else who's interested, I've managed to write this in a way that only works(-ish?) on Python 3.10. In order to resolve this:
Need to take care of some other things, but I'll get back to this on the weekend if I can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks perfect so far ⭐ I'll respond to your questions in a separate post.
...ject/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py
Show resolved
Hide resolved
Very good question. Normally I would agree with you and am also not a fan of explicitly trying to match the string in the pipeline_registry.py file. Here I'm not so sure though. We already have tests in run.feature that do as you describe here:
Hence I think the tests in The general pattern currently seems to be that every But I don't mind changing how |
Generally agree with @AntonyMilneQB about the pattern - one file per In that spirit, if we are not changing this structure, I tend to favor just checking the |
Also, if we do maintain (or adopt) the pattern that |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Renamed that one. ✅ |
I removed it, because the test is really validating the difference between a pipeline defined with nodes and one without, but the written code is equivalent if we use autodiscovery (and thus there's nothing to distinguish). As you mentioned, the actual execution with empty pipeline is already tested, and I've updated the pipeline accordingly. |
This reverts commit 3297763. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
P.S. @deepyaman do you still need to fork kedro to make a PR? |
...ject/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/pipeline_registry.py
Show resolved
Hide resolved
Nope! And I also see I don't need to explicitly at the DCO comment, if it's on the Kedro repo. |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
pipelines_dict = {"__default__": pipeline([])} | ||
for pipeline_dir in importlib_resources.files( | ||
f"{PACKAGE_NAME}.pipelines" | ||
).iterdir(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know importlib_resource
well enough, but this seems actually iterating the filesystem.
importlib_resources.files(
f"{PACKAGE_NAME}.pipelines"
)
If this is the case we need to break it down to do error-handling when importlib_resources.files( f"{PACKAGE_NAME}.pipelines" )
return ModuleNotFoundError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam Sorry, I don't think I follow what you're suggesting here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman Sorry for the late response, GH notifications don't work very well sometimes.
I think this needs to be fixed before this PR is merged. The code block below is outside of the try/catch
block and it will stop the program running if the pipelines
folder isn't exist due to ModuleNotFoundError
.
pipeline_dir in importlib_resources.files(
f"{PACKAGE_NAME}.pipelines"
).iterdir():
This assume importlib_resources.files( f"{PACKAGE_NAME}.pipelines" )
always return a vaild Path
object, thus the subsequent iterdir()
method call. However, if the f"{PACKAGE_NAME}.pipelines"
doesn't exist, you will get a ModuleNotFoundError
instead. ModuleNotFoundError.iterdir()
will be a bug and not handled correctly with the current logic.
Some of our starters only have a package_name/pipeline.py
file but not a package_name/pipelines/pipeline.py
To fix this, we probably need something like this
try:
module = importlib_resources.files(f"{PACKAGE_NAME}.pipelines")
except ModuleNotFoundError: # Or maybe some more?
logger.info(f"{PACKAGE_NAME}.pipelines does not exists")
else:
# Continue with the iterations...
for pipeline_dir in module.iterdir():
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on #1706 (comment), I assumed that it would be OK to not support for now:
However, since this single file (compared to multiple directories) is a bit of an unusual/special case I don't mind too much if we just don't support it. Or we can come back and add support for it later. it does seem like a good case to support though since it's the "simplified kedro project structure" that I think we should still support.
I don't think we should swallow the ModuleNotFoundError
, since find_pipelines()
(currently) does expect a pipelines
directory it can iterate over. That being said, it's also a choose-to-use feature, that's enabled by default when you do kedro new
. Something with a simplified structure without a pipelines
directory would simply not use this.
I personally am not even convinced that the simplified structure needs to be supported. If you just have a pipeline.py
file, with the one create_pipeline()
method in there, why do you need a magical find_pipelines()
? The value of find pipelines is that you don't need to register each pipeline when you create it; in this case, just be explicit and define the one pipeline in the registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman Thanks for the comment! I think you are right that the error shouldn't be swallowed.
My view is slightly different:
- I have a stronger view that we should support the simple
pipeline.py
case. - The error should still be handled and throw a more explicit error, since the module is not imported by the user, it's not obvious why it is failing. It's also better if we mentioned this expected structure in the docstring
I mainly see this feature as a beginner-friendly feature, something that "just work" even if you don't understand how pipeline registration works, as long as you are creating a project with kedro new
and kedro create pipelines
(no need to edit the pipeline_registry.py
).
We also have pandas-iris and other iris
starters that actually follow this simplified structure. It is strange to me a beginner-friendly feature is not available for a simple project.
def register_pipelines() -> Dict[str, Pipeline]:
"""Register the project's pipelines.
Returns:
A mapping from a pipeline name to a ``Pipeline`` object.
"""
return {
"__default__": create_pipeline(),
}
If we support case 1, then 99% of the cases are covered and we won't see the error very often (Point 2) unless users are doing some custom thing (not using standard starters or they have custom structure/logic), which is okay to throw an error.
Imagine a beginner that starts with the pandas-iris
project, and he found modular_pipeline
later, how likely is the user going to find out the find_pipelines
function afterward? It would be a smoother experience if the pipeline is discoverable after he does kedro pipeline create
. It's probably because I always view starters
and Kedro's CLI as entry points for feature discovery.
I think a unified workflow is quite important, especially if we are focusing on expanding the user base with less software engineering background. We shouldn't try too hard to explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman Ideally they should be refactored into folders at the end, but I don't think it's unreasonable to combine both of them as __default__
.
In that case, find_pipelines
is always there and users don't have to update this block, and they can also do kedro run --pipeline=modular_pipeline
.
pipelines = find_pipelines()
pipelines["__default__"] = sum(pipelines.values())
return pipelines
Antony has a different view that it should stop as soon as pipeline.py
is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good discussion here! I started 👍 the ones that I agreed with and that ended up being all the points made. I'm happy to merge this without supporting the simple project structure (single pipeline.py
file) case at all and I'll make a new issue so we can discuss this further.
@deepyaman please could you update the starters which do have the multiple pipeline structure (probably just spaceflights
from memory) to use the new find_pipelines
functionality? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1812.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AntonyMilneQB Thank you for chiming in. I'm in the process of updating to support single pipelines, but would also be more than happy to merge first and push it as a separate PR.
@deepyaman please could you update the starters which do have the multiple pipeline structure (probably just
spaceflights
from memory) to use the newfind_pipelines
functionality? 🙂
Doesn't this need to be released first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, this must be released first - just wanted to mention it while I remembered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested it with the current commit with GitPod by adding the find_packages
fuction.
By default, the find_packages
looks for package_name/pipelines
module, however these demo project only have a package_name.pipeline.py
file but not pipelines
module. As a result I get an error like this.
ModuleNotFoundError: No module named 'project.pipelines'
This can be reproduced by changing the pipeline_registry
file, note that the Kedro's project name is project
.
"""Project pipelines."""
from typing import Dict
from kedro.pipeline import Pipeline
from project.pipeline import create_pipeline
from kedro.framework.project import find_pipelines # <-- New added
def register_pipelines() -> Dict[str, Pipeline]:
pipelines = find_pipelines()
return pipelines
This is caused by the lines
│ ❱ 280 │ for pipeline_dir in importlib_resources.files( │
│ 281 │ │ f"{PACKAGE_NAME}.pipelines" │
│ 282 │ ).iterdir():
Very good point @noklam, thanks for the testing! I had thought of this case before but then forgotten all about it... Ideally I think a very simple project like
However, since this single file (compared to multiple directories) is a bit of an unusual/special case I don't mind too much if we just don't support it. Or we can come back and add support for it later. it does seem like a good case to support though since it's the "simplified kedro project structure" that I think we should still support. Sorry I forgot to put this in the ticket! 🤦 |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Thanks all for reviewing! I think this is more-or-less ready to go, with 3 open comments/confirmations. Now off to writing the associated docs PR! |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great addition to Kedro! Thanks for implementing it @deepyaman 😄 ⭐
Quick question, does |
Yes. |
* Implement autodiscovery of project pipelines 🔍 Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Trawl the `pipelines` directory and return as keys Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Rename "pipelines" to avoid confusion with globals Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Handle case where `create_pipeline` is not exposed Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Warn if `create_pipeline` does not return pipeline Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Handle cases where errors occur upon pipeline load Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Move discovery tests to subfolder to unbreak tests It needs to be discovered after the test without configure project. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Revert "Move discovery tests to subfolder to unbreak tests" This reverts commit 75f79eb. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Move test without configure project to run earlier Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Revert "Move test without configure project to run earlier" This reverts commit 79d7342. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Add `find_pipelines` call to the pipeline registry Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Leverage `files()` API instead of older `contents` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Renamed new_project.feature to new for consistency Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Remove tests that validate the registry's contents Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Remove now-unnecessary ModuleNotFoundError handler Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Move test without configure project to run earlier Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Revert "Move test without configure project to run earlier" This reverts commit 3297763. Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Fix bug wherein fixture is not passed to unit test Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Debug whether and when configure_project is called Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Try and force a reimport of `pipelines` global var Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Ensure that `kedro.framework.project` was reloaded Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Change the way to unload library due to test error Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Clean up kedro.framework.project load and reformat Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Fix typo in requirements.txt Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com> * Update e2e test starter to auto-discover pipelines Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Add space to find_pipelines to improve readability Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Add missing import to registry of the test starter Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Remove unused import from a test pipeline registry Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Make registry behave as before but using discovery Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Update pipeline name in end-to-end test CLI config Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Print underlying error when cannot load a pipeline Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Update RELEASE.md Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Description
Closes #1664
Development notes
Checklist
RELEASE.md
file