Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2419] Make pipeline and Pipeline consistent, take 2 #1147

Merged
merged 15 commits into from
Jan 18, 2022

Conversation

antonymilne
Copy link
Contributor

@antonymilne antonymilne commented Jan 10, 2022

Description

Currently we have node and Node, which are exactly equivalent. We encourage users to do node, which just instantiates Node.

We also have pipeline and Pipeline, which are quite different. Pipeline is the underlying class used to create a pipeline; then pipeline is then used to transform that according to a namespace and dataset input/output mapping. The current use would be:

initial_pipeline = Pipeline([node1, node2, node3], tags=...)
transformed_pipeline = pipeline(initial_pipeline, inputs=..., outputs=..., namespace=...)

(Of course you don't have to transform the pipeline if the initial_pipeline is all you need.)

To be more consistent with node/Node and reduce confusion we want to make all the options available in Pipeline also available to pipeline. We would then encourage users to just use pipeline. This would mean no need for the pipeline(Pipeline()) construction as users would instead just do pipeline([node1, node2, node3], tag=..., input=..., outputs=..., namespace=...).

Note that everything here is actually a non-breaking change, since the only arguments added to pipeline/Pipeline are keyword only and optional.

Question

Do you like this change? @lorenabalan was concerned that it conflates the notion of pipeline and modular pipeline, and the result will be more rather than less confusing to users.

Looking at the arguments that pipeline now takes, it does seem a bit confusing to me also. As a user, when I create my first pipeline it makes sense to me that I can specify the namespace and tags. But why do I have the option of specifying the inputs and outputs? Shouldn't those be taken from the structure of DAG built up from the nodes provided? (Yes they are - inputs/outputs/parameters are only needed if you want to transform those to something else... but somehow these being arguments to the same function that takes tags as an argument doesn't feel quite right to me, like we're mixing two things up).

But also I do think it's nicer to be able to do pipeline() rather than pipeline(Pipeline()). Overall I think this is an improvement but I'm not totally convinced yet.

TODO

This PR

  • Update tests
  • Update release notes

This ticket, new PR(s)

  • Update starters

Separate tickets

  • Update our docs to do pipeline rather than Pipeline
  • Maybe ❓ Move pipeline from modular_pipeline.py to pipeline.py and delete modular_pipeline.py. This would break any imports from kedro.pipeline.modular_pipeline import pipeline but not from kedro.pipeline import pipeline

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@@ -599,7 +599,7 @@ def node(
outputs: Union[None, str, List[str], Dict[str, str]],
*,
name: str = None,
tags: Iterable[str] = None,
tags: Union[str, Iterable[str]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always have been this way so that node and Node match identically.

@merelcht
Copy link
Member

I think this change makes sense, and after our chat I can say that I also like this change 🙂

I don't necessarily think that this change makes the notion of pipeline and modular pipeline and how to create them more confusing than the current state. It's not immediately clear from Pipeline and pipeline which one is used to create a modular pipeline. However, we should help users understand how to use pipeline to create a "regular" or "modular" pipeline, probably through doc string and extra documentation. And on top of that explain when/why they need to provide the other arguments.

Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, think the more I look at it the more it grows on me. Would be good to add some tests.

if not isinstance(pipe, Pipeline):
pipe = Pipeline(pipe, tags=tags)

if inputs is None and outputs is None and parameters is None and namespace is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: We can maybe simplify this condition with an all, sth like

all(x is None for x in (inputs, outputs...))

Copy link
Contributor Author

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finished updating the docstrings, project template and added some tests, so please do re-review and let me know what you think. I'll update docs and starters in separate PRs.

I wasn't sure exactly how many tests to add for this or how many existing ones to change from Pipeline to pipeline so let me know if you think I should do some more there. I did manually do a find and replace for all instances of Pipeline() and replace them with pipeline() and all tests still passed, which is a good sign that people can now always use pipeline.

@@ -1,3 +1,14 @@
# Release 0.17.7
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this exist? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense to make a small release before 0.18.0

"""
if isinstance(pipe, Pipeline):
# To ensure that we are always dealing with a *copy* of pipe.
pipe = Pipeline([pipe], tags=tags)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline only takes an iterable of pipelines/nodes, hence wrapping this in [].

@antonymilne antonymilne self-assigned this Jan 14, 2022
@antonymilne antonymilne marked this pull request as ready for review January 14, 2022 17:45
RELEASE.md Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 👍 Just some small comments from my side.

@@ -1,3 +1,14 @@
# Release 0.17.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense to make a small release before 0.18.0

inputs: A name or collection of input names to be exposed as connection points
to other pipelines upstream.
to other pipelines upstream. This is optional; if not provided, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

When str or Set[str] is provided, the listed input names will stay
the same as they are named in the provided pipeline.
When Dict[str, str] is provided, current input names will be
mapped to new names.
Must only refer to the pipeline's free inputs.
outputs: A name or collection of names to be exposed as connection points
to other pipelines downstream.
to other pipelines downstream. This is optional; if not provided, the
pipeline inputs are automatically inferred from the pipeline structure.
When str or Set[str] is provided, the listed output names will stay
the same as they are named in the provided pipeline.
When Dict[str, str] is provided, current output names will be
mapped to new names.
Can refer to both the pipeline's free outputs, as well as
intermediate results that need to be exposed.
parameters: A map of existing parameter to the new one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also explicitly say that parameters are optional, now it's mentioned for inputs, outputs and tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok without because, unlike the others, parameters is not a required argument to node so there's no real possible confusion here 🙂

tests/pipeline/test_pipeline_helper.py Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌟

Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🎉 I love the additional changes which clean up some docs/typing debt ❤️

@antonymilne antonymilne merged commit 230763a into main Jan 18, 2022
@antonymilne antonymilne deleted the feature/consistent-pipeline-api#ked-2419-2 branch January 18, 2022 14:37
datajoely pushed a commit that referenced this pull request Jan 18, 2022
Signed-off-by: datajoely <joel.schwarzmann@quantumblack.com>
lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022
)

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants