-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise modular pipelines docs #3948
Conversation
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Rendered docs for anyone else reviewing: https://kedro--3948.org.readthedocs.build/en/3948/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for taking on this work @DimedS, you've clearly put a lot of thought in restructuring the docs and I really appreciate the effort. I've left comments in the specific pages but I think on a higher level the flow isn't quite right.
I found that the new structure seems to disrupt the logical flow of information. Sections don't seamlessly transition into one another, making it challenging to follow along. As a result, I'm concerned that the overall clarity and effectiveness of the documentation may have been compromised.
To address this, I propose revisiting the organisation of the content to ensure that each section logically leads into the next, creating a smoother reading experience.
For example, the sections on how to create a pipeline and create a new blank pipeline with the kedro pipeline create command seem very separate and then in the section on pipeline creation structure you jump back again to the example used in "how to create a simple pipeline".
Additionally, we may need to reconsider the level of detail provided in certain sections to strike the right balance between comprehensive coverage and concise readability.
I really like the rephrasing of "modular pipelines" to "reusable pipelines", but from just reading the doc, I don't feel like I'd be able to apply the information in practice. I have always hated that cooking pipeline example and if we change anything I think it has to be that example. In the ticket for this task Revise the modular pipelines documentation to improve clarity Jo mentioned "I'd like to see it more like a tutorial that is based on the spaceflights starter (extends it)." and I totally agree with that.
I'm aware I haven't been involved much in this task, so if any of this goes against what you and @astrojuanlu have discussed, feel free to dismiss it!
Thank you very much for the comprehensive review, @merelcht! I agree that the current flow is not ideal. Let's brainstorm how we can restructure it. My main idea was to split the content into two docs:
I also agree that the cooking example isn't great. I'd be happy to replace it with a more realistic scenario, like a spaceflights extension. |
|
||
### Custom templates | ||
Pipelines are shareable between Kedro codebases via [micro-packaging](micro_packaging.md), but you must follow a couple of rules to ensure portability: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micropackaging is going to be deprecated so perhaps it's worth adding alternate instructions on how to do this?
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@merelcht, @ankatiyar, thank you for your reviews! I have addressed most of your comments. I would appreciate any advice on micropackaging and providing a clearer explanation of namespaces. Additionally, I did not modify the cookie examples for modular pipelines; let's confirm any changes regarding that with @astrojuanlu and Jo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a first pass. Fantastic job @DimedS 👏🏼
Some high level thoughts:
- I really like all the "how-tos" in https://kedro--3948.org.readthedocs.build/en/3948/nodes_and_pipelines/pipeline_introduction.html
- However, from https://kedro--3948.org.readthedocs.build/en/3948/nodes_and_pipelines/pipeline_introduction.html#how-to-create-a-new-blank-pipeline-using-the-kedro-pipeline-create-command onwards, we're talking about Modular Pipelines, because it describes the directory structure that wraps
Pipeline
objects (and here the use of "Modular" is correct!). - And then, the document called "Reuse pipelines with namespaces" goes on to explain mostly namespaces, which I think is good 👍🏼 but it's still named
modular_pipelines.md
So I only have 1 major suggestion about the structure, which is having "Modular Pipelines" still have a page on its own. Something like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments but I thought it's best to review the high-level structure first before going into minor details.
Even our own docs is confusing, we have a page title "Reuse pipeline with namespace" while having the URL as "modular_pipelines.html". They are really different thing.
https://kedro--3948.org.readthedocs.build/en/3948/nodes_and_pipelines/modular_pipelines.html
Question:
- With micro-packging deprecating, do we fade out the "modular pipeline" concept too? My answer is no since
kedro pipeline create
will still use it and in general is a nice idea to keep your pipeline "modular" in terms of configuration, or even dependencies. (So it's easier to run in Orchestrator)
With that in mind, I think we should explains clearly "What is a modular pipeline", it is mainly about the structure, organisation of config/dependencies, what are the benefits etc.
"Modular pipeline" is a concept, then "Namespace" is a Kedro feature that helps you to implement pipeline in a modular way.
|
||
<details> | ||
<summary><b>Click to expand</b></summary> | ||
Kedro's dependency resolution ensures that each node runs only after its required inputs are available from the outputs of previous nodes. This way, the nodes are executed in the correct order automatically, based on the defined dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it helps to print pipline.nodes
which is toposorted already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the section below titled "How to receive information about the nodes in a pipeline" that shows that output, so I don't think it's a good idea to duplicate it on the same page.
Agreed, we should keep the term |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I read through this a bit, if I may, would leave a couple of comments.
I definitely don't have as deep knowledge on namespaces and modular pipelines as maintainers, but I'd say I leverage them comfortably and e.g. making dynamic namespaced pipelines (like described here) and nesting namespaces 2-3 levels deep is something I operate with very frequently. I even found a bug in
kedro viz
at some point related to this :)
Overall I like the current docs a lot more. It is subjective and personal opinion indeed, but a few things it's based on:
- See multiple comments I left.
- Modular pipelines and namespaces are IMO very interconnected and at least for me it's easy to follow if they're on the same page.
- It doesn't describe
pipeline()
wrapper like current docs do, which to me is the key for users to understand how it works.
```python | ||
def create_new_pipeline(**kwargs) -> Pipeline: | ||
return pipeline( | ||
[existing_pipeline], # Name of the existing pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this[existing_pipeline]
instead of existing_pipeline
? Also the comment is a bit inaccurate: it's not # Name of the existing pipeline
but rather a Pipeline
object itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example would be more realistic if existing_pipeline
had a simple but realistic job and clear name. E.g. like cook_pipeline
in old example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the comment and removed the brackets, but I believe existing_pipeline
is a sufficiently descriptive name to define reusability. A more detailed example is provided a bit further down, based on the spaceflights starter.
inputs = {"old_input_df_name" : "new_input_df_name"}, # Mapping old input to new input | ||
outputs = {"old_output_df_name" : "new_output_df_name"}, # Mapping old output to new output | ||
parameters = {"params: model_options": "params: new_model_options"}, # Updating parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old
and new
dataset names do not reflect what modular pipelines really do.
A mental model while working with modular pipelines and namespaces is not that you're replacing something old with something new. It is that you're using something specific instead of something abstract.
Previous example with cooking food I think reflects it better because it shows how "grilled_veg"
which is an output of cook_pipeline
is in one case "breakfast_food"
and in the other it is "lunch_food"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the block inside pipeline()
should be indented I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the indentation and updated the comments. I believe the idea is now also clearly explained within the example—showing how to create different data science pipelines based on the base_data_science pipeline.
If you want to create a new pipeline that performs similar tasks with different inputs/outputs/parameters as your `existing_pipeline`, you can use the same `pipeline()` creation function as described in [How to structure your pipeline creation](modular_pipelines.md#how-to-structure-your-pipeline-creation). This function allows you to overwrite inputs, outputs, and parameters. Your new pipeline creation code should look like this: | ||
|
||
```python | ||
def create_new_pipeline(**kwargs) -> Pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR overall uses less specific names than the current docs, which at least to me is an impact on docs quality. Compare:
create_new_pipeline
to:
final_pipeline = (
cook_breakfast_pipeline
+ eat_breakfast_pipeline
+ cook_lunch_pipeline
+ eat_lunch_pipeline
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to create_pipeline()
. If I understood correctly, the name create_pipeline()
is important (although it is not specific) because otherwise, the pipeline will not be autodiscovered and executed during kedro new
(see Kedro documentation). If you see any other examples with less specific names, please let me know and I would be happy to fix them as well.
func=split_data, | ||
inputs=["model_input_table", "params:model_options"], | ||
outputs=["X_train", "X_test", "y_train", "y_test"], | ||
name="split_data_node", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And similarly in other nodes, because in all the places this name
is used, it is clear that it's a node
.
name="split_data_node", | |
name="split_data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to leave this as it is because it's currently how it's written in our starters, and I don't want to confuse users with different naming since our example is based on that starter.
|
||
## Example: Combining disconnected pipelines | ||
|
||
Sometimes two pipelines must be connected, but do not share any catalog dependencies. In this example, there is a `lunch_pipeline`, which makes us lunch. The 'verbs', `defrost` and `eat`, are Python functions and the inputs/outputs are food at different points of the process (`frozen`, `thawed` and `food`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thawed
is not present in the example at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to remove the cooking examples.
```{note} | ||
The set of overriding inputs and outputs must be a subset of the reused pipeline's "free" inputs and outputs, respectively. A free input is an input that isn't generated by a node in the pipeline, while a free output is an output that isn't consumed by a node in the pipeline. {py:meth}`Pipeline.inputs() <kedro.pipeline.Pipeline.inputs>` and {py:meth}`Pipeline.outputs() <kedro.pipeline.Pipeline.outputs>` can be used to list a pipeline's free inputs and outputs, respectively. | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both cooking and data science example to demonstrate how namespaces work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. We want to move away from the cooking example and shift to something more practical, based on our spaceflights-pandas starter. I previously kept the cooking example, but let's go ahead and remove it.
Hi @yury-fedotov, thanks a lot for your review! 🙏🏼 I'd like to add clarification on two bits though:
I'll let @DimedS act on the rest of your review comments as he sees fit. |
To be on the same page in terminology, by modular pipeline I mean a pipeline that is:
Yeah but I'd assume most use them in a non-modular fashion, i.e. not leveraging
I'm curious to understand why. IMO there are 2 reasons in Kedro to use modular pipelines:
In either case, Can you give an example of when you'd leverage a modular pipeline without a |
Sure, I didn't mean my definition is correct, that's just what I have in mind currently. Would be happy to change my mind if it's inaccurate. But overall, when you say that modular pipelines and namespaces aren't connected, is there an example of a pipeline that's modular but not using namespaces to act as modular? |
According to my current understanding of what modular pipelines are, Modular with no namespacesThat one is easy:
... def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
) Not modular with namespacesIn def register_pipelines() -> Dict[str, Pipeline]:
base_pipeline = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node",
),
node(
func=train_model,
inputs=["X_train", "y_train"],
outputs="regressor",
name="train_model_node",
),
node(
func=evaluate_model,
inputs=["regressor", "X_test", "y_test"],
outputs=None,
name="evaluate_model_node",
),
]
)
ds_pipeline_1 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="active_modelling_pipeline",
)
ds_pipeline_2 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
return {"__default__": ds_pipeline_1 + ds_pipeline_2} So the pipeline is defined directly in the |
Also note that our docs already say that the way to create & delete modular pipelines is with |
@astrojuanlu thanks for providing those examples! Yeah I think we just define modular pipelines differently. To simplify context a bit, let's assume that
base_pipeline = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node",
),
node(
func=train_model,
inputs=["X_train", "y_train"],
outputs="regressor",
name="train_model_node",
),
node(
func=evaluate_model,
inputs=["regressor", "X_test", "y_test"],
outputs=None,
name="evaluate_model_node",
),
]
)
ds_pipeline_1 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="active_modelling_pipeline",
)
ds_pipeline_2 = pipeline(
pipe=base_pipeline,
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
) Move it to manually created |
I'm happy to do a final read through when further requested changes are complete. Please just ping me @DimedS when you're ready! |
Many thanks for your review, @yury-fedotov, and your comments, @astrojuanlu. I mostly agree with @astrojuanlu's understanding that modular pipelines are currently a more virtual concept - just a folder structure. For me, clarifying exactly what that means is not as important. I actually wanted to remove the term "modular" entirely to not to confuse users. What's more important to me is to clarify for users how to effectively use Kedro to solve their different tasks. I like the current docs structure from that perspective because it's more task-oriented now:
|
Co-authored-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small final suggestions, but nothing blocking.
Thanks for the many revisions of these docs @DimedS ⭐ From my point of view this is now a lot better than what we had before.
I would like us to put some effort in creating a more realistic example of re-using the pipeline so instead of data_science_1 and data_science_2 we can refer to a proper use case that requires re-use, but let's leave that for another time.
- company_rating | ||
``` | ||
|
||
> In Kedro, you cannot run pipelines with the same node names. In this example, both pipelines have nodes with the same names, so it's impossible to execute them together. However, `base_data_science` is not registered and will not be executed with the `kedro run` command. The `data_science` pipeline, on the other hand, will be executed during `kedro run` because it will be autodiscovered by Kedro, as it was created inside the `create_pipeline()` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing because in the code snippet above :
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[base_data_science], # Creating a new data_science pipeline based on base_data_science pipeline
parameters={"params:model_options": "params:model_options_1"}, # Using a new set of parameters to train model
)
base_data_science
is created inside create_pipeline()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than in the explanation there's no actual reference to data_science
anywhere in the code examples on this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it by adding the comment # data_science pipeline creation function
before the create_pipeline()
function. However, as I understand it, I have to use create_pipeline()
name to ensure the pipeline will be autodiscovered. Do you have other ideas on how to highlight the data_science
name inside the code?
|
||
A namespace is a way to isolate nodes, inputs, outputs, and parameters inside your pipeline. If you put `namespace="namespace_name"` attribute inside the `pipeline()` creation function, it will add the `namespace_name.` prefix to all nodes, inputs, outputs, and parameters inside your new pipeline. | ||
|
||
Let's extend our previous example and try to reuse the `base_data_science` pipeline one more time by creating another pipeline based on it. First, we should use the `kedro pipeline create` command to create a new blank pipeline named `data_science_2`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view it doesn't matter how you create the pipelines, what's important is how you re-use them in the end. It's up to the user if they want the pipelines to be in the same folder or separated out (which is what happens when you do kedro pipeline create
).
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Oh, you merged it. I can still see quite a few Vale issues that I would have preferred to have been resolved, but I guess they can be addressed separately as needed. |
Adding that as a follow-up task in #2723 (comment) |
I'm sorry @stichbury, I forgot to ping you before merging. There were many comments in the PR, and we agreed to continue addressing part of them in another PR. Thanks for the follow-up task description, @astrojuanlu. |
Description
Revised the pipeline and modular-pipelines documentation sections, introduced namespaces docs section.
Main Changes:
Pipeline creation structure
section, which describes how to create a pipeline after using thekedro pipeline create
command, what to include innodes.py
andpipeline.py
, and why, as well as other available options.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file