[KED-2419] Make pipeline and Pipeline consistent, take 2 #1147

antonymilne · 2022-01-10T17:21:18Z

Description

Currently we have node and Node, which are exactly equivalent. We encourage users to do node, which just instantiates Node.

We also have pipeline and Pipeline, which are quite different. Pipeline is the underlying class used to create a pipeline; then pipeline is then used to transform that according to a namespace and dataset input/output mapping. The current use would be:

initial_pipeline = Pipeline([node1, node2, node3], tags=...)
transformed_pipeline = pipeline(initial_pipeline, inputs=..., outputs=..., namespace=...)

(Of course you don't have to transform the pipeline if the initial_pipeline is all you need.)

To be more consistent with node/Node and reduce confusion we want to make all the options available in Pipeline also available to pipeline. We would then encourage users to just use pipeline. This would mean no need for the pipeline(Pipeline()) construction as users would instead just do pipeline([node1, node2, node3], tag=..., input=..., outputs=..., namespace=...).

Note that everything here is actually a non-breaking change, since the only arguments added to pipeline/Pipeline are keyword only and optional.

Question

Do you like this change? @lorenabalan was concerned that it conflates the notion of pipeline and modular pipeline, and the result will be more rather than less confusing to users.

Looking at the arguments that pipeline now takes, it does seem a bit confusing to me also. As a user, when I create my first pipeline it makes sense to me that I can specify the namespace and tags. But why do I have the option of specifying the inputs and outputs? Shouldn't those be taken from the structure of DAG built up from the nodes provided? (Yes they are - inputs/outputs/parameters are only needed if you want to transform those to something else... but somehow these being arguments to the same function that takes tags as an argument doesn't feel quite right to me, like we're mixing two things up).

But also I do think it's nicer to be able to do pipeline() rather than pipeline(Pipeline()). Overall I think this is an improvement but I'm not totally convinced yet.

TODO

This PR

Update tests
Update release notes

This ticket, new PR(s)

Update starters

Separate tickets

Update our docs to do pipeline rather than Pipeline
Maybe ❓ Move pipeline from modular_pipeline.py to pipeline.py and delete modular_pipeline.py. This would break any imports from kedro.pipeline.modular_pipeline import pipeline but not from kedro.pipeline import pipeline

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

antonymilne · 2022-01-10T17:22:08Z

kedro/pipeline/node.py

@@ -599,7 +599,7 @@ def node(
    outputs: Union[None, str, List[str], Dict[str, str]],
    *,
    name: str = None,
-    tags: Iterable[str] = None,
+    tags: Union[str, Iterable[str]] = None,


This should always have been this way so that node and Node match identically.

kedro/pipeline/pipeline.py

merelcht · 2022-01-12T16:35:41Z

I think this change makes sense, and after our chat I can say that I also like this change 🙂

I don't necessarily think that this change makes the notion of pipeline and modular pipeline and how to create them more confusing than the current state. It's not immediately clear from Pipeline and pipeline which one is used to create a modular pipeline. However, we should help users understand how to use pipeline to create a "regular" or "modular" pipeline, probably through doc string and extra documentation. And on top of that explain when/why they need to provide the other arguments.

lorenabalan

LGTM, think the more I look at it the more it grows on me. Would be good to add some tests.

lorenabalan · 2022-01-13T15:50:21Z

kedro/pipeline/modular_pipeline.py

+    if not isinstance(pipe, Pipeline):
+        pipe = Pipeline(pipe, tags=tags)
+
+    if inputs is None and outputs is None and parameters is None and namespace is None:


Optional: We can maybe simplify this condition with an all, sth like

all(x is None for x in (inputs, outputs...))

…ked-2419-2' into feature/consistent-pipeline-api#ked-2419-2

antonymilne

I've finished updating the docstrings, project template and added some tests, so please do re-review and let me know what you think. I'll update docs and starters in separate PRs.

I wasn't sure exactly how many tests to add for this or how many existing ones to change from Pipeline to pipeline so let me know if you think I should do some more there. I did manually do a find and replace for all instances of Pipeline() and replace them with pipeline() and all tests still passed, which is a good sign that people can now always use pipeline.

antonymilne · 2022-01-14T17:36:37Z

RELEASE.md

@@ -1,3 +1,14 @@
+# Release 0.17.7


Will this exist? 🤔

Yes I think it makes sense to make a small release before 0.18.0

antonymilne · 2022-01-14T17:39:17Z

kedro/pipeline/modular_pipeline.py

    """
+    if isinstance(pipe, Pipeline):
+        # To ensure that we are always dealing with a *copy* of pipe.
+        pipe = Pipeline([pipe], tags=tags)


Pipeline only takes an iterable of pipelines/nodes, hence wrapping this in [].

kedro/pipeline/modular_pipeline.py

RELEASE.md

merelcht

Nice work! 👍 Just some small comments from my side.

merelcht · 2022-01-17T10:26:55Z

RELEASE.md

@@ -1,3 +1,14 @@
+# Release 0.17.7


Yes I think it makes sense to make a small release before 0.18.0

merelcht · 2022-01-17T10:28:35Z

kedro/pipeline/modular_pipeline.py

        inputs: A name or collection of input names to be exposed as connection points
-            to other pipelines upstream.
+            to other pipelines upstream. This is optional; if not provided, the


merelcht · 2022-01-17T10:30:17Z

kedro/pipeline/modular_pipeline.py

            When str or Set[str] is provided, the listed input names will stay
            the same as they are named in the provided pipeline.
            When Dict[str, str] is provided, current input names will be
            mapped to new names.
            Must only refer to the pipeline's free inputs.
        outputs: A name or collection of names to be exposed as connection points
-            to other pipelines downstream.
+            to other pipelines downstream. This is optional; if not provided, the
+            pipeline inputs are automatically inferred from the pipeline structure.
            When str or Set[str] is provided, the listed output names will stay
            the same as they are named in the provided pipeline.
            When Dict[str, str] is provided, current output names will be
            mapped to new names.
            Can refer to both the pipeline's free outputs, as well as
            intermediate results that need to be exposed.
        parameters: A map of existing parameter to the new one.


Should we also explicitly say that parameters are optional, now it's mentioned for inputs, outputs and tags?

I think this is ok without because, unlike the others, parameters is not a required argument to node so there's no real possible confusion here 🙂

tests/pipeline/test_pipeline_helper.py

merelcht

🌟

idanov

Awesome 🎉 I love the additional changes which clean up some docs/typing debt ❤️

Signed-off-by: datajoely <joel.schwarzmann@quantumblack.com>

) Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

antonymilne added 3 commits January 10, 2022 16:57

Make node signature consistent

e34ded2

Update docstring for Pipeline.tag

85fccd0

Update pipeline function

29085ad

antonymilne commented Jan 10, 2022

View reviewed changes

kedro/pipeline/pipeline.py Outdated Show resolved Hide resolved

Update kedro/pipeline/pipeline.py

38ac519

antonymilne mentioned this pull request Jan 10, 2022

[KED-2419] Make pipeline and Pipeline consistent #1140

Closed

11 tasks

lorenabalan approved these changes Jan 13, 2022

View reviewed changes

antonymilne added 6 commits January 14, 2022 16:24

Update docstrings

4612b59

Add tests

275a3b9

Update cookie cutter template and test starter

bd9de67

Merge remote-tracking branch 'origin/feature/consistent-pipeline-api#…

76dbddc

…ked-2419-2' into feature/consistent-pipeline-api#ked-2419-2

Add import

0b93f18

Add release note

020799e

antonymilne commented Jan 14, 2022

View reviewed changes

kedro/pipeline/modular_pipeline.py Outdated Show resolved Hide resolved

Update kedro/pipeline/modular_pipeline.py

9cb85a2

antonymilne self-assigned this Jan 14, 2022

Merge branch 'main' into feature/consistent-pipeline-api#ked-2419-2

0bdeedf

antonymilne marked this pull request as ready for review January 14, 2022 17:45

antonymilne requested a review from idanov as a code owner January 14, 2022 17:45

antonymilne requested review from lorenabalan and merelcht January 14, 2022 17:45

antonymilne commented Jan 14, 2022

View reviewed changes

RELEASE.md Outdated Show resolved Hide resolved

Update RELEASE.md

a96d0b8

antonymilne mentioned this pull request Jan 17, 2022

[KED-2419] Make pipeline and Pipeline consistent kedro-org/kedro-starters#59

Merged

3 tasks

merelcht reviewed Jan 17, 2022

View reviewed changes

Update test

ea4e090

merelcht approved these changes Jan 18, 2022

View reviewed changes

idanov approved these changes Jan 18, 2022

View reviewed changes

Merge branch 'main' into feature/consistent-pipeline-api#ked-2419-2

9351cb9

antonymilne merged commit 230763a into main Jan 18, 2022

antonymilne deleted the feature/consistent-pipeline-api#ked-2419-2 branch January 18, 2022 14:37

datajoely pushed a commit that referenced this pull request Jan 18, 2022

[KED-2419] Make pipeline and Pipeline consistent, take 2 (#1147)

2a11342

Signed-off-by: datajoely <joel.schwarzmann@quantumblack.com>

merelcht mentioned this pull request Feb 28, 2022

[KED-3055] Update docs to use pipeline rather than Pipeline #1299

Closed

lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022

[KED-2419] Make pipeline and Pipeline consistent, take 2 (kedro-org#1147

0bde80e

) Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

astrojuanlu mentioned this pull request Jun 25, 2023

Rectify "modular pipelines" terminology #2723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-2419] Make pipeline and Pipeline consistent, take 2 #1147

[KED-2419] Make pipeline and Pipeline consistent, take 2 #1147

antonymilne commented Jan 10, 2022 •

edited

Loading

antonymilne Jan 10, 2022

merelcht commented Jan 12, 2022

lorenabalan left a comment

lorenabalan Jan 13, 2022

antonymilne left a comment

antonymilne Jan 14, 2022

merelcht Jan 17, 2022

antonymilne Jan 14, 2022

merelcht left a comment

merelcht Jan 17, 2022

merelcht Jan 17, 2022

merelcht Jan 17, 2022

antonymilne Jan 17, 2022

merelcht left a comment

idanov left a comment

[KED-2419] Make pipeline and Pipeline consistent, take 2 #1147

[KED-2419] Make pipeline and Pipeline consistent, take 2 #1147

Conversation

antonymilne commented Jan 10, 2022 • edited Loading

Description

Question

TODO

This PR

This ticket, new PR(s)

Separate tickets

Checklist

Choose a reason for hiding this comment

merelcht commented Jan 12, 2022

lorenabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonymilne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

idanov left a comment

Choose a reason for hiding this comment

antonymilne commented Jan 10, 2022 •

edited

Loading