-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset missing from lineage graph #2543
Comments
Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template! |
We have the exact same issue while working with OpenLineage and Spark. It would be great if this gets fixed soon. Without this its almost unusable. |
@rkrao89 I very much agree with you. Unfortunately there's zero response from the maintainers which is a shame because the project looked very promising to us. |
@yonivy please, do not be such judgemental for Open Source project maintainers that deliver code for you without any expectations... especially since the solution is actively worked upon in OpenLineage repo - as it's the source of the problem, not UI/backend part. |
No judgment here just observing the state of my question. It's also fine if OpenLineage won't solve my case (even though I think it's a common one) I was just hoping for some response. In any case I'll just say that I was very happy to find out that OpenLineage exists and I appreciate the work that open-source maintainers do so apologies if it came out wrong. As for the draft PR you linked it seems new so I did not see it when I asked my question a month ago but it seems spark specific (isn't it?) so it probably won't solve my case. I'll subscribe to it so I appreciate the link :) |
Some issues unfortunately can go through the cracks - fortunately we were already aware of the issue when you created this one. Thanks for understanding, we hope to solve the problem soon 🙂 |
Hi @mobuchowski , I can be wrong but it seems the repo/PR addresses OpenLineage/OpenLineage#1965 which OP referenced as raised by his coworker in past. The issue OP is referring in this page about lineage not showing users_address seems to inherent to Dynamic Lineage. The latest run shows what latest run knows, it has no memory of prior runs. May be static lineage to rescue, will be really curious to know if solution exist for issue reported by OP on this page. Thank you again for great work community is doing. |
@githubopenlineageissues we want to recognize those jobs as inherently different - let's say you have a Spark job or microservice, or even CI task that copies data from A to B - but you provide those A and B when running the job. So, in reality, their only common thing is the fact they share code - but they are different "instances" of those jobs. This is logically similar to tasks in Airflow - you can have multiple PostgresOperators in a DAG, but that does not mean they are the same OpenLineage job. |
Hey we are having the same issue of orphaned datasets, pretty similar to OpenLineage/OpenLineage#1965. |
@yonivy: you raise a very good point (and also apologize for those on the thread on not getting back until now). I agree that this is more broad and not specific to an OL integration (like spark, or airflow), but there are some challenges to ensure the lineage graphs completeness. But first, let me outline what Marquez supports for lineage:
The Marquez model captures lineage from run-to-run and that run-level lineage metadata can be queried, but there isn't an API (yet!) that given The challenges of lineage graphs completeness is that we would have to assume (and this would be a big assumption) that if a dataset was present on run I've added this issues to our roadmap and will link it when we start working on run-level lineage (which will be within the next month or so). I hope this helps to clarify things. It doesn't solve the issue now, but hope you will find the run-level lineage API useful. I would love initial thoughts on what I've outlined here (but also in my proposal) from yourself and anyone who has run into this issue (@rkrao89, @yonivy, @AryamanMishra). |
A colleague of mine of opened an issue in the OpenLineage repo and received no response so far so perhaps this is the right place to post issues in :)
The issue we are facing is that Marquez seems to break lineage if the same logical job produces different datasets on different runs. Our reality (and I believe others as well) is that our processes are dynamic in their output. I do not think this is an edge-case.
The use case is this:
Example
The example below is super simplified but I believe it paints the right picture.
Job name:
users_etl
Job input: The last modified file(s) found in the path template
s3:///users/{yyyy}/{mm}/{dd}
Run no. 1
The input file contains nested user info (first_name, last_name, email, address: {city, state}) so the job will update the
users
table (which has the first_name, last_name and email columns) and the tableusers_address
which has the city and state columns).Output:
users
tableusers_address
tableRun no. 2
The input file contains flat user info (first_name, last_name, email) so the job will update the
users
table (which has the first_name, last_name and email columns).Output:
users
tableThe Problem
In Marquez I can only see the
users
table in the lineage of theusers_etl
job. Theusers_address
dataset gets orphaned.The state after Run no. 1
Everything is as expected.
The state after Run no. 2
Only the latest output is displayed.
and the previous output is now completely detached from the lineage graph!
The Expectation
I expected to continue and see the
users_address
table in the lineage graph. Without it all I'm getting is last-run lineage and while that is useful for some cases it presents a confusing image which does not reflect the reality of relationships between jobs and datasets. I mean what can I understand about theusers_address
table, that it simply popped into existence?The text was updated successfully, but these errors were encountered: