Pipeline builder #3865

mle-els · 2023-06-26T07:11:35Z

mle-els
Jun 26, 2023

Description

I doesn't like the way pipelines are specified. It's verbose and hard to read. When looking for workflow software, I almost skipped Kedro because the pipeline code is so ugly.

Context

A better way of building pipelines would make my code more pleasant to write and read. It should help other people too and help Kedro recruit more users.

Possible Implementation

UPDATE: THIS IS A NEW PROPOSAL:

@kedro_function
def func1(x) -> Any:
    ...

@kedro_function
def func2(y) -> Any:
    ...

def create_pipeline():
    with kedro_pipeline_builder() as pb:
        y = func1('input', output_names='dataset_y')
        out = func2(y, output_names='dataset_z')
        return pb.build()

END UPDATE

I have made a simple implementation of a pipeline builder, modeling after the spaceflights example. The pipeline can be built like this which works exactly the same way as the original pipeline:

pb = PipelineBuilder()
preprocessed_companies = pb.call(preprocess_companies, 'companies', name='preprocess_companies_node')
preprocessed_shuttles = pb.call(preprocess_shuttles, 'shuttles', name='preprocess_shuttles_node')
model_input_table = pb.call(create_model_input_table, preprocessed_shuttles, 
                            preprocessed_companies, "reviews", name='create_model_input_table_node')
X_train, X_test, y_train, y_test = pb.call(split_data, model_input_table, "params:model_options", 
                                           name="split_data_node")
regressor = pb.call(train_model, X_train, y_train, name="train_model_node")
pb.call(evaluate_model, regressor, X_test, y_test, name="evaluate_model_node")
pip = pb.build(names=locals())

Executable code can be found in https://github.com/mle-els/kedro-pipeline-builder

astrojuanlu · 2023-06-26T08:00:57Z

astrojuanlu
Jun 26, 2023
Maintainer

Hi @mle-els, thanks for your suggestion. It's cool to see that it only took you ~60 lines of code to customize Kedro to your needs!

However, I don't see how your PipelineBuilder interacts with the Kedro catalog. Looks like the first inputs of pb.call are datasets, but then intermediate results are stored in variables, am I right?

I noticed this in your README:

Because datasets are now variables that are returned and passed to another function, typos can be easily detected.

Could you describe this problem in a bit more detail?

0 replies

mle-els · 2023-06-26T11:54:58Z

mle-els
Jun 26, 2023
Author

Hi @astrojuanlu, thanks for your response. PipelineBuilder uses the name of the local variables as the name of the corresponding dataset (hence the use of locals()). If a dataset doesn't have a variable name or we don't pass locals() when invoking build(), a name will be automatically generated, such as dataset1.

The bit about typos: when dataset names are strings, IDEs can't check that they actually match. I ran into this problem before when some code changes lead to the wrong dataset being used. When dataset names are variable names, this kind of issues will be caught sooner.

0 replies

astrojuanlu · 2023-06-26T13:00:01Z

astrojuanlu
Jun 26, 2023
Maintainer

Thanks @mle-els. If I understand correctly there are two issues at hand:

Current syntax to specify the pipelines is "verbose", "hard to read", and "ugly". I tend to agree on the verbosity, but the ugliness is a matter of taste I'd say. Some things we could do to reduce the verbosity:
- Assign an appropriate node name if one is not given instead of setting it to None, so that most users don't need to specify the name (we save 1 line per node).
- Explore a decorator syntax [To discuss] @node decorator syntax #2471
It's difficult to detect mistakes and typos in dataset names. I think this is worth studying, I've heard of similar struggles but I don't think we have an issue capturing it. Things we could do:
- Create a VSCode extension (no need to change the syntax, unclear how difficult this could be)
- Propose an alternative API that makes introspection easier.

What do you think of this assessment?

0 replies

noklam · 2023-06-26T13:43:38Z

noklam
Jun 26, 2023
Collaborator

Linking this kedro-lsp for the VSCode Extension, I also vaguely remember a conversation that we may get a team to make a vscode extension for kedro.

0 replies

mle-els · 2023-06-26T13:55:38Z

mle-els
Jun 26, 2023
Author

Hi @astrojuanlu, I thought of the decorator syntax but didn't find it satisfactory because of 2 reasons:

It conflates functions and nodes. I have functions that are reused in different pipelines (and want to keep open the ability to reuse others in new pipelines). Tying functions to specific inputs and outputs makes reuse harder.
It reduces readability because the pipeline is now scattered in different places. To grasp a pipeline, one would need to match input/output dataset names and scroll up and down to collect the nodes in their head. Whereas with the proposed pipeline builder, everything is in one block and one can understand a pipeline with just a glance.

About VSCode extension, not everyone uses VSCode. I think an alternative API will apply more broadly. Plus, installing an extension is an extra step added to the learning curve.

I think the most important feature of the proposed builder is that it looks just like normal Python code. Everyone is familiar with how to write and comprehend a Python function. By being friendly and intuitive, you'll make it more attractive for new users.

0 replies

astrojuanlu · 2023-06-26T14:18:27Z

astrojuanlu
Jun 26, 2023
Maintainer

Clarification on "VSCode extension": the Language Server Protocol (LSP) is now understood by VSCode, vim, JupyterLab, and others. So kedro-lsp is not exclusive to VSCode. Looks like the only major editor missing LSP support is the IntelliJ family.

Having said that, I sympathize with not tying the Kedro DX to a specific family of editors, and exploring alternative APIs that make inspection easy. However, the PipelineBuilder API you propose is quite different from the way things are usually done in Kedro, so it's not something we'd start working on without giving it a lot of thought. I personally don't like having to use locals() to consolidate all inputs (could be error prone if there are other intermediate variables) and having to repeat pb.call for every node. Also, this creates an API in which some inputs are strings that get parsed from the catalog and some inputs are variables.

Just to set the expectations clear, although this touches adjacent issues that are already in our radar, we have limited resources and I don't think we're going to look into this particular API any time soon.

0 replies

mle-els · 2023-06-27T07:25:41Z

mle-els
Jun 27, 2023
Author

HI @astrojuanlu, here is an alternative design that I thought of but it's harder to pull off so I went for pb.call() instead. If you think it's worth considering then I can draft some code.

@kedro_function
def func1(x) -> Any:
    ...

@kedro_function
def func2(y) -> Any:
    ...

def create_pipeline():
    with kedro_pipeline_builder() as pb:
        y = func1('input', output_names='dataset_y')
        out = func2(y, output_names='dataset_z')
        return pb.build()

0 replies

mle-els · 2023-06-27T07:32:13Z

mle-els
Jun 27, 2023
Author

We could also wrap datasets and params into variables like pb.dataset('input') and pb.param('param') but I don't think it adds to readability. Using a mixture of strings and variables looks fine to me.

0 replies

dertilo · 2023-12-06T11:29:21Z

dertilo
Dec 6, 2023

actually the need for another tool (IDE-extension/kedro-lsp) makes it quite obvious that kedro's pipeline creation api has (very) low usability
the main issue lies within kedro's approach to construct DAGs (deeply nested things) via lists (flat things) of nodes that are stitched together via string-literals (quite "unpythonic")
here an example from the kedro-docs:

variance_pipeline = pipeline(
    [ # flat list makes it hard to "see" the DAG
        node(len, "xs", "n"), # stitched together by strings encoding names of input/output variables
        node(mean, ["xs", "n"], "m", name="mean_node"),
        node(mean_sos, ["xs", "n"], "m2", name="mean_sos"),
        node(variance, ["m", "m2"], "v", name="variance_node"),
    ]
)

why not "simply" use nested function-calls like dask does it?

# "flat" -> one node per statement
n = len(x)
m = mean(x, n)
m2 = mean_sos(x, n)
v = variance(m, m2)

# "nested" -> with some imagination one can already "see" the DAG
n = len(x)
v = variance(
    m=mean(xs=x, n=n),
    m2=mean_sos(xs=x, n=n)
)

if one does not like to imagine DAGs from python code one could simply use mermaid graphs

graph TD
    a[x]-->|*arg|len[len]

    len-->|n|mean[mean]
    a-->|xs|mean[mean]


    len-->|n|sos[mean_sos]
    a-->|xs|sos[mean_sos]
    
    sos-->|m2|var[variance]
    mean-->|m|var[variance]

personally I prefere nesting dataclasses instead of functions in order to construct DAGs

0 replies

astrojuanlu · 2023-12-06T16:16:43Z

astrojuanlu
Dec 6, 2023
Maintainer

Hi @dertilo , thanks for chiming in.

Your proposal suffers from the same problem as the original one:

n = len(x)
v = variance(
    m=mean(xs=x, n=n),
    m2=mean_sos(xs=x, n=n)
)

where does x come from?

0 replies

dertilo · 2023-12-08T09:29:26Z

dertilo
Dec 8, 2023

in this specific example x seems to be an inmemory variable
in general such "input-nodes" could load some data from some storage, example:

@dataclass
class SomeDatum:
    some_id:str
    data: WhateverData # text,image,video,array,tensor, who-cares
    more_data: FooBar

@dataclass
class SomeInputNode(Iterable[SomeDatum]):
    """
    this could be an input-node to some data-processing DAG
    """
    name:str

    def __iter__(self) -> Iterator[SomeDatum]:
        # some data-loading code here

x=SomeInputNode("foobar")

graph TD
    a[SomeInputNode]-->|x|b[some DAG here]

0 replies

astrojuanlu · 2023-12-08T16:45:22Z

astrojuanlu
Dec 8, 2023
Maintainer

Thanks @dertilo - then for the "free inputs" one would still need to use Python strings.

One of the nice things of the current design is that free inputs and intermediate results are all treated the same, which gives the pipeline definition a uniform aspect.

Another useful thing is that the "wrapping" of the nodes happens outside of the node definition. This was discussed in #2471 - by having decorators like other frameworks and SDKs, the nodes are coupled with the framework in question. In Kedro, nodes can be reused because they are regular, pure functions that do no I/O (see https://sans-io.readthedocs.io/, https://rhodesmill.org/brandon/slides/2015-05-pywaw/hoist/).

the main issue lies within kedro's approach to construct DAGs (deeply nested things) via lists (flat things) of nodes

Mathematically, a graph is a set of nodes $G = (V, E)$.

that are stitched together via string-literals (quite "unpythonic")

At some point the user needs to specify the names of the datasets, either for everything (current design) or for the free inputs (proposed designs). Dataset names are no different from dictionary keys that represent something else - with the caveat that, yes, such keys are not immediately available in the IDE.

0 replies

astrojuanlu · 2024-05-13T08:10:19Z

astrojuanlu
May 13, 2024
Maintainer

Hi! @mle-els in case it helps, we released a VSCode extension that makes navigating between the Python and YAML files from a Kedro project a breeze https://marketplace.visualstudio.com/items?itemName=kedro.Kedro

Turning this issue into a Discussion to continue the conversation there

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline builder #3865

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Pipeline builder #3865

mle-els Jun 26, 2023

Description

Context

Possible Implementation

Replies: 13 comments

astrojuanlu Jun 26, 2023 Maintainer

mle-els Jun 26, 2023 Author

astrojuanlu Jun 26, 2023 Maintainer

noklam Jun 26, 2023 Collaborator

mle-els Jun 26, 2023 Author

astrojuanlu Jun 26, 2023 Maintainer

mle-els Jun 27, 2023 Author

mle-els Jun 27, 2023 Author

dertilo Dec 6, 2023

astrojuanlu Dec 6, 2023 Maintainer

dertilo Dec 8, 2023

astrojuanlu Dec 8, 2023 Maintainer

astrojuanlu May 13, 2024 Maintainer

mle-els
Jun 26, 2023

astrojuanlu
Jun 26, 2023
Maintainer

mle-els
Jun 26, 2023
Author

astrojuanlu
Jun 26, 2023
Maintainer

noklam
Jun 26, 2023
Collaborator

mle-els
Jun 26, 2023
Author

astrojuanlu
Jun 26, 2023
Maintainer

mle-els
Jun 27, 2023
Author

mle-els
Jun 27, 2023
Author

dertilo
Dec 6, 2023

astrojuanlu
Dec 6, 2023
Maintainer

dertilo
Dec 8, 2023

astrojuanlu
Dec 8, 2023
Maintainer

astrojuanlu
May 13, 2024
Maintainer