Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset missing from lineage graph #2543

Open
yonivy opened this issue Jul 16, 2023 · 10 comments
Open

Dataset missing from lineage graph #2543

yonivy opened this issue Jul 16, 2023 · 10 comments
Milestone

Comments

@yonivy
Copy link

yonivy commented Jul 16, 2023

A colleague of mine of opened an issue in the OpenLineage repo and received no response so far so perhaps this is the right place to post issues in :)

The issue we are facing is that Marquez seems to break lineage if the same logical job produces different datasets on different runs. Our reality (and I believe others as well) is that our processes are dynamic in their output. I do not think this is an edge-case.

The use case is this:

  • We have a logical ETL job which is scheduled to run a few times during the day.
  • The job pushes data into tables based on the contents of the input files (which are in S3).

Example

The example below is super simplified but I believe it paints the right picture.

Job name: users_etl
Job input: The last modified file(s) found in the path template s3:///users/{yyyy}/{mm}/{dd}

Run no. 1

The input file contains nested user info (first_name, last_name, email, address: {city, state}) so the job will update the users table (which has the first_name, last_name and email columns) and the table users_address which has the city and state columns).

Output:

  • users table
  • users_address table

Run no. 2

The input file contains flat user info (first_name, last_name, email) so the job will update the users table (which has the first_name, last_name and email columns).

Output:

  • users table

The Problem

In Marquez I can only see the users table in the lineage of the users_etl job. The users_address dataset gets orphaned.

The state after Run no. 1

Everything is as expected.

image

The state after Run no. 2

Only the latest output is displayed.

image

and the previous output is now completely detached from the lineage graph!

image

The Expectation

I expected to continue and see the users_address table in the lineage graph. Without it all I'm getting is last-run lineage and while that is useful for some cases it presents a confusing image which does not reflect the reality of relationships between jobs and datasets. I mean what can I understand about the users_address table, that it simply popped into existence?

@boring-cyborg
Copy link

boring-cyborg bot commented Jul 16, 2023

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

@rkrao89
Copy link

rkrao89 commented Aug 9, 2023

We have the exact same issue while working with OpenLineage and Spark. It would be great if this gets fixed soon. Without this its almost unusable.

@yonivy
Copy link
Author

yonivy commented Aug 10, 2023

@rkrao89 I very much agree with you. Unfortunately there's zero response from the maintainers which is a shame because the project looked very promising to us.

@mobuchowski
Copy link
Contributor

mobuchowski commented Aug 10, 2023

@yonivy please, do not be such judgemental for Open Source project maintainers that deliver code for you without any expectations... especially since the solution is actively worked upon in OpenLineage repo - as it's the source of the problem, not UI/backend part.

@yonivy
Copy link
Author

yonivy commented Aug 10, 2023

No judgment here just observing the state of my question. It's also fine if OpenLineage won't solve my case (even though I think it's a common one) I was just hoping for some response. In any case I'll just say that I was very happy to find out that OpenLineage exists and I appreciate the work that open-source maintainers do so apologies if it came out wrong.

As for the draft PR you linked it seems new so I did not see it when I asked my question a month ago but it seems spark specific (isn't it?) so it probably won't solve my case. I'll subscribe to it so I appreciate the link :)

@mobuchowski
Copy link
Contributor

Some issues unfortunately can go through the cracks - fortunately we were already aware of the issue when you created this one. Thanks for understanding, we hope to solve the problem soon 🙂

@githubopenlineageissues

Hi @mobuchowski , I can be wrong but it seems the repo/PR addresses OpenLineage/OpenLineage#1965 which OP referenced as raised by his coworker in past. The issue OP is referring in this page about lineage not showing users_address seems to inherent to Dynamic Lineage. The latest run shows what latest run knows, it has no memory of prior runs. May be static lineage to rescue, will be really curious to know if solution exist for issue reported by OP on this page. Thank you again for great work community is doing.

@mobuchowski
Copy link
Contributor

@githubopenlineageissues we want to recognize those jobs as inherently different - let's say you have a Spark job or microservice, or even CI task that copies data from A to B - but you provide those A and B when running the job. So, in reality, their only common thing is the fact they share code - but they are different "instances" of those jobs. This is logically similar to tasks in Airflow - you can have multiple PostgresOperators in a DAG, but that does not mean they are the same OpenLineage job.

@AryamanMishra
Copy link

AryamanMishra commented Nov 29, 2023

Hey we are having the same issue of orphaned datasets, pretty similar to OpenLineage/OpenLineage#1965.
Any leads?

@wslulciuc wslulciuc modified the milestones: 0.46.0, Roadmap Jan 30, 2024
@wslulciuc
Copy link
Member

wslulciuc commented Jan 30, 2024

@yonivy: you raise a very good point (and also apologize for those on the thread on not getting back until now). I agree that this is more broad and not specific to an OL integration (like spark, or airflow), but there are some challenges to ensure the lineage graphs completeness. But first, let me outline what Marquez supports for lineage:

  1. Static Lineage; static lineage represents the current graph (i.e. the most recent OL events that have been collected on the backend -- this is what you are seeing now).
  2. Column-Level Lineage that, given a runID will return the column lineage at the time of job execution (relative to the run).

The Marquez model captures lineage from run-to-run and that run-level lineage metadata can be queried, but there isn't an API (yet!) that given runID, will return a lineage snapshot at the time of job execution (similar to column lineage). We do have a proposal that would help in resolving what you (and others) are seeing and is on our roadmap. The API will support Run-level Lineage, that given a runID, will return the edges that are no longer present in the static lineage graph.

The challenges of lineage graphs completeness is that we would have to assume (and this would be a big assumption) that if a dataset was present on run 1, but is no longer present on run 2 that 1) it wasn't intended or 2) it was and we should merge the edges from run-to-run. We are making significant improvements to the UI (see the PR from @phixMe) that will make viewing static, column-level and soon run-level lineage more intuitive but also display the highly dimensional model of Marquez in a more exploratory way.

I've added this issues to our roadmap and will link it when we start working on run-level lineage (which will be within the next month or so). I hope this helps to clarify things. It doesn't solve the issue now, but hope you will find the run-level lineage API useful.

I would love initial thoughts on what I've outlined here (but also in my proposal) from yourself and anyone who has run into this issue (@rkrao89, @yonivy, @AryamanMishra).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

6 participants