Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise modular pipelines docs #3948

Merged
merged 33 commits into from
Jul 2, 2024
Merged

Conversation

DimedS
Copy link
Contributor

@DimedS DimedS commented Jun 11, 2024

Description

Revised the pipeline and modular-pipelines documentation sections, introduced namespaces docs section.

Main Changes:

  1. Sorted topics within these sections: simple usage is now in the pipeline introduction, while pipeline reusage is covered in the advanced-level second document.
  2. Added a table of contents with links at the top of the pipeline introduction.
  3. Also added a Pipeline creation structure section, which describes how to create a pipeline after using the kedro pipeline create command, what to include in nodes.py and pipeline.py, and why, as well as other available options.
  4. Grouped key pipeline methods and operations.
  5. Simplified the pipeline reuse documentation to focus more on practical implementation.
  6. Reduced the size of pictures by 60% as they were too large and not useful.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS linked an issue Jun 11, 2024 that may be closed by this pull request
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS changed the title Docs update Revise modular pipelines docs Jun 11, 2024
DimedS and others added 6 commits June 11, 2024 21:37
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS marked this pull request as ready for review June 11, 2024 22:05
@merelcht
Copy link
Member

Rendered docs for anyone else reviewing: https://kedro--3948.org.readthedocs.build/en/3948/

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking on this work @DimedS, you've clearly put a lot of thought in restructuring the docs and I really appreciate the effort. I've left comments in the specific pages but I think on a higher level the flow isn't quite right.

I found that the new structure seems to disrupt the logical flow of information. Sections don't seamlessly transition into one another, making it challenging to follow along. As a result, I'm concerned that the overall clarity and effectiveness of the documentation may have been compromised.

To address this, I propose revisiting the organisation of the content to ensure that each section logically leads into the next, creating a smoother reading experience.

For example, the sections on how to create a pipeline and create a new blank pipeline with the kedro pipeline create command seem very separate and then in the section on pipeline creation structure you jump back again to the example used in "how to create a simple pipeline".

Additionally, we may need to reconsider the level of detail provided in certain sections to strike the right balance between comprehensive coverage and concise readability.

I really like the rephrasing of "modular pipelines" to "reusable pipelines", but from just reading the doc, I don't feel like I'd be able to apply the information in practice. I have always hated that cooking pipeline example and if we change anything I think it has to be that example. In the ticket for this task Revise the modular pipelines documentation to improve clarity Jo mentioned "I'd like to see it more like a tutorial that is based on the spaceflights starter (extends it)." and I totally agree with that.

I'm aware I haven't been involved much in this task, so if any of this goes against what you and @astrojuanlu have discussed, feel free to dismiss it!

docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
@DimedS
Copy link
Contributor Author

DimedS commented Jun 12, 2024

Thank you very much for the comprehensive review, @merelcht! I agree that the current flow is not ideal. Let's brainstorm how we can restructure it. My main idea was to split the content into two docs:

  1. Basic topics that everyone should know (including how to use the Kedro pipeline create command and our recommended practices for writing pipeline code).
  2. Advanced topics on pipeline reuse, including namespacing and overwriting.

I also agree that the cooking example isn't great. I'd be happy to replace it with a more realistic scenario, like a spaceflights extension.
Let's wait for some more feedback and then think how to refine the structure!

docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved

### Custom templates
Pipelines are shareable between Kedro codebases via [micro-packaging](micro_packaging.md), but you must follow a couple of rules to ensure portability:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micropackaging is going to be deprecated so perhaps it's worth adding alternate instructions on how to do this?

DimedS and others added 4 commits June 17, 2024 10:34
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS
Copy link
Contributor Author

DimedS commented Jun 18, 2024

@merelcht, @ankatiyar, thank you for your reviews! I have addressed most of your comments. I would appreciate any advice on micropackaging and providing a clearer explanation of namespaces. Additionally, I did not modify the cookie examples for modular pipelines; let's confirm any changes regarding that with @astrojuanlu and Jo.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first pass. Fantastic job @DimedS 👏🏼

Some high level thoughts:

So I only have 1 major suggestion about the structure, which is having "Modular Pipelines" still have a page on its own. Something like:

image

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments but I thought it's best to review the high-level structure first before going into minor details.

Even our own docs is confusing, we have a page title "Reuse pipeline with namespace" while having the URL as "modular_pipelines.html". They are really different thing.
https://kedro--3948.org.readthedocs.build/en/3948/nodes_and_pipelines/modular_pipelines.html

Question:

  • With micro-packging deprecating, do we fade out the "modular pipeline" concept too? My answer is no since kedro pipeline create will still use it and in general is a nice idea to keep your pipeline "modular" in terms of configuration, or even dependencies. (So it's easier to run in Orchestrator)

With that in mind, I think we should explains clearly "What is a modular pipeline", it is mainly about the structure, organisation of config/dependencies, what are the benefits etc.

"Modular pipeline" is a concept, then "Namespace" is a Kedro feature that helps you to implement pipeline in a modular way.

docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/modular_pipelines.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved

<details>
<summary><b>Click to expand</b></summary>
Kedro's dependency resolution ensures that each node runs only after its required inputs are available from the outputs of previous nodes. This way, the nodes are executed in the correct order automatically, based on the defined dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it helps to print pipline.nodes which is toposorted already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the section below titled "How to receive information about the nodes in a pipeline" that shows that output, so I don't think it's a good idea to duplicate it on the same page.

@astrojuanlu
Copy link
Member

With micro-packging deprecating, do we fade out the "modular pipeline" concept too? My answer is no since kedro pipeline create will still use it

Agreed, we should keep the term

Copy link
Contributor

@yury-fedotov yury-fedotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I read through this a bit, if I may, would leave a couple of comments.

I definitely don't have as deep knowledge on namespaces and modular pipelines as maintainers, but I'd say I leverage them comfortably and e.g. making dynamic namespaced pipelines (like described here) and nesting namespaces 2-3 levels deep is something I operate with very frequently. I even found a bug in kedro viz at some point related to this :)

Overall I like the current docs a lot more. It is subjective and personal opinion indeed, but a few things it's based on:

  1. See multiple comments I left.
  2. Modular pipelines and namespaces are IMO very interconnected and at least for me it's easy to follow if they're on the same page.
  3. It doesn't describe pipeline() wrapper like current docs do, which to me is the key for users to understand how it works.

```python
def create_new_pipeline(**kwargs) -> Pipeline:
return pipeline(
[existing_pipeline], # Name of the existing pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this[existing_pipeline] instead of existing_pipeline? Also the comment is a bit inaccurate: it's not # Name of the existing pipeline but rather a Pipeline object itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example would be more realistic if existing_pipeline had a simple but realistic job and clear name. E.g. like cook_pipeline in old example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the comment and removed the brackets, but I believe existing_pipeline is a sufficiently descriptive name to define reusability. A more detailed example is provided a bit further down, based on the spaceflights starter.

Comment on lines 11 to 13
inputs = {"old_input_df_name" : "new_input_df_name"}, # Mapping old input to new input
outputs = {"old_output_df_name" : "new_output_df_name"}, # Mapping old output to new output
parameters = {"params: model_options": "params: new_model_options"}, # Updating parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old and new dataset names do not reflect what modular pipelines really do.

A mental model while working with modular pipelines and namespaces is not that you're replacing something old with something new. It is that you're using something specific instead of something abstract.

Previous example with cooking food I think reflects it better because it shows how "grilled_veg" which is an output of cook_pipeline is in one case "breakfast_food" and in the other it is "lunch_food"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the block inside pipeline() should be indented I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the indentation and updated the comments. I believe the idea is now also clearly explained within the example—showing how to create different data science pipelines based on the base_data_science pipeline.

docs/source/nodes_and_pipelines/namespaces.md Show resolved Hide resolved
If you want to create a new pipeline that performs similar tasks with different inputs/outputs/parameters as your `existing_pipeline`, you can use the same `pipeline()` creation function as described in [How to structure your pipeline creation](modular_pipelines.md#how-to-structure-your-pipeline-creation). This function allows you to overwrite inputs, outputs, and parameters. Your new pipeline creation code should look like this:

```python
def create_new_pipeline(**kwargs) -> Pipeline:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR overall uses less specific names than the current docs, which at least to me is an impact on docs quality. Compare:

create_new_pipeline

to:

final_pipeline = (
    cook_breakfast_pipeline
    + eat_breakfast_pipeline
    + cook_lunch_pipeline
    + eat_lunch_pipeline
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to create_pipeline(). If I understood correctly, the name create_pipeline() is important (although it is not specific) because otherwise, the pipeline will not be autodiscovered and executed during kedro new (see Kedro documentation). If you see any other examples with less specific names, please let me know and I would be happy to fix them as well.

func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly in other nodes, because in all the places this name is used, it is clear that it's a node.

Suggested change
name="split_data_node",
name="split_data",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave this as it is because it's currently how it's written in our starters, and I don't want to confuse users with different naming since our example is based on that starter.

docs/source/nodes_and_pipelines/namespaces.md Outdated Show resolved Hide resolved

## Example: Combining disconnected pipelines

Sometimes two pipelines must be connected, but do not share any catalog dependencies. In this example, there is a `lunch_pipeline`, which makes us lunch. The 'verbs', `defrost` and `eat`, are Python functions and the inputs/outputs are food at different points of the process (`frozen`, `thawed` and `food`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thawed is not present in the example at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to remove the cooking examples.

```{note}
The set of overriding inputs and outputs must be a subset of the reused pipeline's "free" inputs and outputs, respectively. A free input is an input that isn't generated by a node in the pipeline, while a free output is an output that isn't consumed by a node in the pipeline. {py:meth}`Pipeline.inputs() <kedro.pipeline.Pipeline.inputs>` and {py:meth}`Pipeline.outputs() <kedro.pipeline.Pipeline.outputs>` can be used to list a pipeline's free inputs and outputs, respectively.
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both cooking and data science example to demonstrate how namespaces work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. We want to move away from the cooking example and shift to something more practical, based on our spaceflights-pandas starter. I previously kept the cooking example, but let's go ahead and remove it.

docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/namespaces.md Outdated Show resolved Hide resolved
@astrojuanlu
Copy link
Member

Hi @yury-fedotov, thanks a lot for your review! 🙏🏼

I'd like to add clarification on two bits though:

  • The explanation on the wrapper was a lie: kedro.pipeline.modular_pipeline.pipeline and kedro.pipeline.pipeline are exactly the same, as I painfully discovered in Rectify "modular pipelines" terminology #2723 one year ago when I taught namespaces for the first time. I'd even advocate to make it private API, but that's for another day. Regardless, it's already time to remove it from our docs.
  • I don't think modular pipelines and namespaces are connected at all. In fact, most Kedro projects make use of modular pipelines (since that's the output of kedro pipeline create) but not all of them use namespaces, and it's fine. What's more: somebody could have non-modular pipelines (hence by not following the directory structure recommended by Kedro) and still use namespaces. The messy terminology made these concepts difficult to understand even for Kedro power users and the Kedro team itself (in fact, this PR was prompted by a discussion about kedro-airflow), so we considered it was urgent to clarify all these topics.

I'll let @DimedS act on the rest of your review comments as he sees fit.

@yury-fedotov
Copy link
Contributor

yury-fedotov commented Jun 30, 2024

@astrojuanlu

To be on the same page in terminology, by modular pipeline I mean a pipeline that is:

  1. Defined but not registered itself
  2. Leveraged in pipeline() wrapper, which overrides abstract inputs and outputs by specific catalog datasets, and this let's say "instance" is what's being registred.
  3. Now for simplicity let's assume we only do 1 level of namespacing.

most Kedro projects make use of modular pipelines

Yeah but I'd assume most use them in a non-modular fashion, i.e. not leveraging inputs, outputs, parameters and namespace arguments of pipeline. Actually because of this what I do is after kedro pipeline create I go to pipeline.py and change the pipeline constructor from pipeline to Pipeline if I do not use namespaces.

I don't think modular pipelines and namespaces are connected at all

I'm curious to understand why. IMO there are 2 reasons in Kedro to use modular pipelines:

  1. Simplify viz output and overall mental model by namespacing all internals so that other pipelines care only about what's on the edges.
  2. Reuse the same pipeline structure more than once.

In either case, namespace is what enables you do that. I don't understand why are they not connected.

Can you give an example of when you'd leverage a modular pipeline without a namespace?

@astrojuanlu
Copy link
Member

by modular pipeline I mean a pipeline that is:

I don't claim to be in possession of the right meaning for "modular pipeline" but we are talking about 2 different things. Maybe we need some arbitration :) @idanov @merelcht @noklam

@yury-fedotov
Copy link
Contributor

by modular pipeline I mean a pipeline that is:

I don't claim to be in possession of the right meaning for "modular pipeline" but we are talking about 2 different things. Maybe we need some arbitration :) @idanov @merelcht @noklam

Sure, I didn't mean my definition is correct, that's just what I have in mind currently. Would be happy to change my mind if it's inaccurate.

But overall, when you say that modular pipelines and namespaces aren't connected, is there an example of a pipeline that's modular but not using namespaces to act as modular?

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 30, 2024

is there an example of a pipeline that's modular but not using namespaces to act as modular?

According to my current understanding of what modular pipelines are,

Modular with no namespaces

That one is easy:

$ kedro pipeline create

...

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ]
    )

Not modular with namespaces

In pipeline_registry.py:

def register_pipelines() -> Dict[str, Pipeline]:
    base_pipeline = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

    return {"__default__": ds_pipeline_1 + ds_pipeline_2}

So the pipeline is defined directly in the pipeline_registry.py, hence not "modular" (unusual but nonetheless allowed), and it makes use of namespaces.

@astrojuanlu
Copy link
Member

Also note that our docs already say that the way to create & delete modular pipelines is with kedro pipeline create/delete https://docs.kedro.org/en/0.19.6/development/commands_reference.html#modular-pipelines

@yury-fedotov
Copy link
Contributor

@astrojuanlu thanks for providing those examples!

Yeah I think we just define modular pipelines differently. To simplify context a bit, let's assume that kedro pipeline create/delete do not exist, and we create pipeline folder structure ourselves. It actually takes a few seconds to do, so I think this example is quite realistic.

  • Your first examples (with preprocessing companies and shuttles) - in what way is it modular? It points to specific items in the catalog, hence cannot be reused multiple times with different inputs and outputs. Well, in theory it can of course, but how is it different from the case where I'd just use a Pipeline() constructor as give it a list of nodes?
  • What you have as a base_pipeline is what I would call modular. It is reusable and not registred itself, but rather 2 instances of that are created. They are created using namespaces, which allow reusability. I get why you call it not modular, but this is just because it's defined directly in the pipeline_registry.py, which to me is a weird pattern. If we take all this:
base_pipeline = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=base_pipeline,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

Move it to manually created pipelines/data_science/pipeline.py, and then import ds_pipeline_1 and ds_pipeline_2 in registry, and registred. Would this make this pipeline modular?

@stichbury
Copy link
Contributor

I'm happy to do a final read through when further requested changes are complete. Please just ping me @DimedS when you're ready!

@DimedS
Copy link
Contributor Author

DimedS commented Jul 1, 2024

Many thanks for your review, @yury-fedotov, and your comments, @astrojuanlu. I mostly agree with @astrojuanlu's understanding that modular pipelines are currently a more virtual concept - just a folder structure. For me, clarifying exactly what that means is not as important. I actually wanted to remove the term "modular" entirely to not to confuse users. What's more important to me is to clarify for users how to effectively use Kedro to solve their different tasks.

I like the current docs structure from that perspective because it's more task-oriented now:

  1. In the intro, we explain that a pipeline is a combination of nodes with auto-execution order resolution.
  2. In the modular pipelines section, we discuss the benefits of storing pipelines in separate folders, teach how to use the pipeline create command, and describe the general pipeline structure.
  3. In the last and most complicated section, we explain how to reuse pipelines with namespaces. This is definitely an advanced-level section, and I hope it is now clearer for beginners.
    @yury-fedotov, I will address your other comments today.

DimedS and others added 2 commits July 1, 2024 13:32
Co-authored-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@merelcht merelcht dismissed their stale review July 2, 2024 09:07

Changes were made.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small final suggestions, but nothing blocking.

Thanks for the many revisions of these docs @DimedS ⭐ From my point of view this is now a lot better than what we had before.

I would like us to put some effort in creating a more realistic example of re-using the pipeline so instead of data_science_1 and data_science_2 we can refer to a proper use case that requires re-use, but let's leave that for another time.

docs/source/nodes_and_pipelines/namespaces.md Outdated Show resolved Hide resolved
- company_rating
```

> In Kedro, you cannot run pipelines with the same node names. In this example, both pipelines have nodes with the same names, so it's impossible to execute them together. However, `base_data_science` is not registered and will not be executed with the `kedro run` command. The `data_science` pipeline, on the other hand, will be executed during `kedro run` because it will be autodiscovered by Kedro, as it was created inside the `create_pipeline()` function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing because in the code snippet above :

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [base_data_science],  # Creating a new data_science pipeline based on base_data_science pipeline
        parameters={"params:model_options": "params:model_options_1"},  # Using a new set of parameters to train model
    )

base_data_science is created inside create_pipeline()...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than in the explanation there's no actual reference to data_science anywhere in the code examples on this page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it by adding the comment # data_science pipeline creation function before the create_pipeline() function. However, as I understand it, I have to use create_pipeline() name to ensure the pipeline will be autodiscovered. Do you have other ideas on how to highlight the data_science name inside the code?


A namespace is a way to isolate nodes, inputs, outputs, and parameters inside your pipeline. If you put `namespace="namespace_name"` attribute inside the `pipeline()` creation function, it will add the `namespace_name.` prefix to all nodes, inputs, outputs, and parameters inside your new pipeline.

Let's extend our previous example and try to reuse the `base_data_science` pipeline one more time by creating another pipeline based on it. First, we should use the `kedro pipeline create` command to create a new blank pipeline named `data_science_2`:
Copy link
Member

@merelcht merelcht Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view it doesn't matter how you create the pipelines, what's important is how you re-use them in the end. It's up to the user if they want the pipelines to be in the same folder or separated out (which is what happens when you do kedro pipeline create).

docs/source/nodes_and_pipelines/namespaces.md Outdated Show resolved Hide resolved
docs/source/nodes_and_pipelines/pipeline_introduction.md Outdated Show resolved Hide resolved
DimedS and others added 3 commits July 2, 2024 11:07
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS merged commit c269cde into main Jul 2, 2024
10 checks passed
@DimedS DimedS deleted the 1998-revise-the-modular-pipelines-docs branch July 2, 2024 10:39
@stichbury
Copy link
Contributor

I'm happy to do a final read through when further requested changes are complete. Please just ping me @DimedS when you're ready!

Oh, you merged it. I can still see quite a few Vale issues that I would have preferred to have been resolved, but I guess they can be addressed separately as needed.

@astrojuanlu
Copy link
Member

Adding that as a follow-up task in #2723 (comment)

@DimedS
Copy link
Contributor Author

DimedS commented Jul 2, 2024

I'm happy to do a final read through when further requested changes are complete. Please just ping me @DimedS when you're ready!

Oh, you merged it. I can still see quite a few Vale issues that I would have preferred to have been resolved, but I guess they can be addressed separately as needed.

I'm sorry @stichbury, I forgot to ping you before merging. There were many comments in the PR, and we agreed to continue addressing part of them in another PR. Thanks for the follow-up task description, @astrojuanlu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revise the modular pipelines documentation to improve clarity
7 participants