Allow multiple nodes to return same output #806

CapTen101 · 2021-06-24T07:53:35Z

Description

I want to visualize a particular pipeline in which multiple nodes point to same output. This gives me an error in the current version of kedro 0.17.4. Below is the error that I get:

Error: Output(s) ['target_1', 'target_2'] are returned by more than one nodes. Node outputs must be unique.

node(
        node_dummy_fun,
        inputs=["source_1"],
        outputs="target_1",
        name="job_id_x2",
    ),
node(
        node_dummy_fun,
        inputs=None,
        outputs="target_1",
        name="random",
    ),

Above nodes have different inputs and they point to same outputs.

Context

I just need to use Kedro viz for visualization of data-lineage and this scenario can benefit from this feature.

The text was updated successfully, but these errors were encountered:

antonymilne · 2021-06-24T08:15:03Z

Hello @CapTen101. This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way?

What exactly are you trying to achieve here? Maybe there is another way it's possible while working within the constraints of kedro's graph structure.

If you're not interested in running the DAG through kedro run but just looking to visualise a graph using kedro viz then this might already be possible through manually editing the json file that kedro viz uses to generate the graph.

datajoely · 2021-06-24T08:24:36Z

As discussed on Discord here I think we can close this issue

CapTen101 · 2021-06-24T09:07:33Z

@AntonyMilneQB

Hello @CapTen101. This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way?

What exactly are you trying to achieve here? Maybe there is another way it's possible while working within the constraints of kedro's graph structure.

If you're not interested in running the DAG through kedro run but just looking to visualise a graph using kedro viz then this might already be possible through manually editing the json file that kedro viz uses to generate the graph.

It is basically a table in database which is being read or written to by different SQL jobs.. and that is a real word scenario which doesn't indroduce any DAGs.

CapTen101 · 2021-06-24T10:22:44Z

@AntonyMilneQB It is achievable with a small modification in the Kedro source code! I commented line no. 178 in kedro/pipeline/pipeline.py and it worked.

datajoely · 2021-06-24T10:39:28Z

Well done! Since the constraint is on the Kedro core side rather than the Viz side, perhaps you want to look into generating the JSON yourself longer term?

You can see what Kedro generates using this command:

I'm going to close this issue since it's an intended constraint on the core side - but Viz doesn't really care!

datajoely · 2021-06-24T10:40:31Z

@CapTen101 one other suggestion by my colleague @limdauto was to monkey patch out _validate_unique_outputs() so that you don't need to fork the library, but it is still a bit hacky.

CapTen101 · 2021-06-24T11:09:17Z

@datajoely Actually, I went into my virtual environment directory and commented the line of code there itself. This way it'll work for all other visualization projects created inside that virtual environment.

benniedp · 2023-07-13T10:45:46Z

Sorry to dig up an old closed issue, but I'm also running into this problem.

This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way?

I'm trying to understand the above constraint (from @antonymilne's comment above). Why does Kedro care what happens when a function's result is given to an output DataSet? Is it not up to the DataSet to decide how to manage this?

For example, given that node functions are expected to be pure functions, it might be useful to implement a write-only DataSet that simply writes a node's result to standard output. There's no reason such a class shouldn't be reusable as an output. Am I missing something?

astrojuanlu · 2023-07-13T15:10:59Z

Hi @benniedp, I'm re-reading this old conversation and it's not entirely clear to me how the DAG properties prevent that datasets can be connected to two incoming nodes. Also, I cannot open the Discord link @datajoely shared because I was never part of the server.

I understand though that, from a Kedro perspective (not from a mathematical/DAG perspective) it's problematic to have two nodes return the same output with certain types of datasets.

To your original question:

For example, given that node functions are expected to be pure functions, it might be useful to implement a write-only DataSet that simply writes a node's result to standard output. There's no reason such a class shouldn't be reusable as an output.

Let's call the nodes A and B. How should we decide whether the output should be AB or BA?

deepyaman · 2023-07-14T00:10:51Z

I'm trying to understand the above constraint (from @antonymilne's comment above). Why does Kedro care what happens when a function's result is given to an output DataSet? Is it not up to the DataSet to decide how to manage this?

@benniedp I think Kedro cares because it doesn't want users unknowingly/accidentally creating race conditions/unpredictable behavior. If Kedro quietly overwrote data, it could be extremely painful to debug. One can also argue, in 95+% of situations, you don't want to overwrite a dataset during a run--why would it be useful?

An appendable, (possibly) write-only dataset is one situation where this is useful, as you've raised. The pandas.ExcelDataSet is an example of this, when used with append mode. Kedro expects you to be explicit and create a dataset for each output, even though you're modifying the same file. Thinking about it, I'm not sure how it would play with a non-sequential runner. A drawback of this approach is, if you want to use the dataset as an input downstream, you have to include all appended output datasets as input to the consuming node (to ensure execution order).

Another situation where this is useful, from my experience, is with PartitionedDataSets. I think it's quite possible to want to generate some set of partition outputs using some logic in one node, and another set for different partitions and with different logic in another node, but to consume them together. (Or maybe use different modular pipelines to all produce partitions in the same dataset, and consume them in a final node without a fake no-op "merge" step.)

In both of these situations, Kedro stipulates additional restrictions--namely that you can't use versioning. This is because, as soon as you enable versioning, you're writing to a different filepath, and therefore a different dataset. This could even come up for a single intended run, in case of failure/having to resume; you'll get a new timestamped version on the run. There's no way that I'm aware of to solve this, without changing the way versioning in Kedro works.

My guess is that, for these reasons--increased likelihood of wrong behavior, the relative infrequency of this need, the fact that there is a way to achieve the goal by defining additional datasets, and potential limitations of enabling this--the additional restriction on the Kedro DAG exists.

stefano-brambilla-venchi · 2024-07-04T15:15:47Z

Hi everybody,
I came across this issue when facing a specific use case. I understand the reason behind these restrictions, but then I am not sure what should be the BEST practice, according to Kedro, in my use case.

I have a parquet table with columns ["date", "country", "store", "some_metrics"].
I have specific functions that compute those "some metrics" daily for each store in each country.

I initially designed the process creating a node for each store, all writing to a table in catalog defined as:

my_table:
    filepath: my_path
    type: spark.SparkDataset
    file_format: parquet
    save_args:
        mode: append
        partitionBy: ["date", "country", "store"]

node(
    compute_metric_country_store,
    inputs={},
    outputs="my_table"
    name = my_node
)

In that way, with a daily trigger, each node should run and add their rows in the correct partition. The "append" mode guarantees that whatever order Kedro choose for the nodes, is consistent.
Giving a node for each store (hundreds) allow us to check clearly the failed nodes.
Having only one parquet file allow us to keep clarity in the process and not having a monster catalog.

This architecture, though, is against the rule "each output only in one node"; in this ETL use case (that I believe is not that unique!), what would it be the BEST pratice for Kedro?

I am open to any suggestion, thank you.

CapTen101 added the Issue: Feature Request New feature or improvement to existing feature label Jun 24, 2021

datajoely closed this as completed Jun 24, 2021

Galileo-Galilei mentioned this issue Sep 20, 2021

Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends #904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple nodes to return same output #806

Allow multiple nodes to return same output #806

CapTen101 commented Jun 24, 2021

antonymilne commented Jun 24, 2021

datajoely commented Jun 24, 2021

CapTen101 commented Jun 24, 2021 •

edited

Loading

CapTen101 commented Jun 24, 2021

datajoely commented Jun 24, 2021

datajoely commented Jun 24, 2021 •

edited

Loading

CapTen101 commented Jun 24, 2021

benniedp commented Jul 13, 2023 •

edited

Loading

astrojuanlu commented Jul 13, 2023

deepyaman commented Jul 14, 2023

stefano-brambilla-venchi commented Jul 4, 2024 •

edited

Loading

Allow multiple nodes to return same output #806

Allow multiple nodes to return same output #806

Comments

CapTen101 commented Jun 24, 2021

Description

Context

antonymilne commented Jun 24, 2021

datajoely commented Jun 24, 2021

CapTen101 commented Jun 24, 2021 • edited Loading

CapTen101 commented Jun 24, 2021

datajoely commented Jun 24, 2021

datajoely commented Jun 24, 2021 • edited Loading

CapTen101 commented Jun 24, 2021

benniedp commented Jul 13, 2023 • edited Loading

astrojuanlu commented Jul 13, 2023

deepyaman commented Jul 14, 2023

stefano-brambilla-venchi commented Jul 4, 2024 • edited Loading

CapTen101 commented Jun 24, 2021 •

edited

Loading

datajoely commented Jun 24, 2021 •

edited

Loading

benniedp commented Jul 13, 2023 •

edited

Loading

stefano-brambilla-venchi commented Jul 4, 2024 •

edited

Loading