-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple nodes to return same output #806
Comments
Hello @CapTen101. This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way? What exactly are you trying to achieve here? Maybe there is another way it's possible while working within the constraints of kedro's graph structure. If you're not interested in running the DAG through |
As discussed on Discord here I think we can close this issue |
@AntonyMilneQB
It is basically a table in database which is being read or written to by different SQL jobs.. and that is a real word scenario which doesn't indroduce any DAGs. |
@CapTen101 one other suggestion by my colleague @limdauto was to monkey patch out |
@datajoely Actually, I went into my virtual environment directory and commented the line of code there itself. This way it'll work for all other visualization projects created inside that virtual environment. |
Sorry to dig up an old closed issue, but I'm also running into this problem.
I'm trying to understand the above constraint (from @antonymilne's comment above). Why does Kedro care what happens when a function's result is given to an output DataSet? Is it not up to the DataSet to decide how to manage this? For example, given that node functions are expected to be pure functions, it might be useful to implement a write-only DataSet that simply writes a node's result to standard output. There's no reason such a class shouldn't be reusable as an output. Am I missing something? |
Hi @benniedp, I'm re-reading this old conversation and it's not entirely clear to me how the DAG properties prevent that datasets can be connected to two incoming nodes. Also, I cannot open the Discord link @datajoely shared because I was never part of the server. I understand though that, from a Kedro perspective (not from a mathematical/DAG perspective) it's problematic to have two nodes return the same output with certain types of datasets. To your original question:
Let's call the nodes A and B. How should we decide whether the output should be AB or BA? |
@benniedp I think Kedro cares because it doesn't want users unknowingly/accidentally creating race conditions/unpredictable behavior. If Kedro quietly overwrote data, it could be extremely painful to debug. One can also argue, in 95+% of situations, you don't want to overwrite a dataset during a run--why would it be useful? An appendable, (possibly) write-only dataset is one situation where this is useful, as you've raised. The Another situation where this is useful, from my experience, is with In both of these situations, Kedro stipulates additional restrictions--namely that you can't use versioning. This is because, as soon as you enable versioning, you're writing to a different filepath, and therefore a different dataset. This could even come up for a single intended run, in case of failure/having to resume; you'll get a new timestamped version on the run. There's no way that I'm aware of to solve this, without changing the way versioning in Kedro works. My guess is that, for these reasons--increased likelihood of wrong behavior, the relative infrequency of this need, the fact that there is a way to achieve the goal by defining additional datasets, and potential limitations of enabling this--the additional restriction on the Kedro DAG exists. |
Hi everybody, I have a parquet table with columns ["date", "country", "store", "some_metrics"]. I initially designed the process creating a node for each store, all writing to a table in catalog defined as:
In that way, with a daily trigger, each node should run and add their rows in the correct partition. The "append" mode guarantees that whatever order Kedro choose for the nodes, is consistent. This architecture, though, is against the rule "each output only in one node"; in this ETL use case (that I believe is not that unique!), what would it be the BEST pratice for Kedro? I am open to any suggestion, thank you. |
Description
I want to visualize a particular pipeline in which multiple nodes point to same output. This gives me an error in the current version of kedro 0.17.4. Below is the error that I get:
Above nodes have different inputs and they point to same outputs.
Context
I just need to use Kedro viz for visualization of data-lineage and this scenario can benefit from this feature.
The text was updated successfully, but these errors were encountered: