Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for a cumulative lineage graph at both levels #2670

Open
davidjgoss opened this issue Oct 28, 2023 · 2 comments
Open

Support for a cumulative lineage graph at both levels #2670

davidjgoss opened this issue Oct 28, 2023 · 2 comments

Comments

@davidjgoss
Copy link
Contributor

davidjgoss commented Oct 28, 2023

Problem

Currently, there is a mismatch between how the dataset/job-level and column-level lineage endpoints behave:

  • /v1/lineage returns a graph of datasets and jobs only considering the latest job version (i.e. the last run) of involved jobs
  • /v1/column-lineage returns all column-level relations from all runs

With the first endpoint, this means if a job's behaviour is variable between runs (e.g. sometimes it touches a dataset, and sometimes it doesn't), you can have datasets drop off the graph even though the column-level relation will still be reflected on the second endpoint. This means that if you want to combine the graphs to form a dataset- and column-level visualisation of lineage, things can be a bit weird e.g. a column-to-column relationship could be seen but without a corresponding dataset-job-dataset relation.

More generally, seeing the cumulative lineage graph based on all historical runs would be desirable, where the user is trying to ascertain what has actually been happening over time rather than the current state of things.

Context

This has been reported as an issue a few other times:

As discussed with @wslulciuc and @collado-mike recently, the /v1/lineage endpoint used to have this cumulative behaviour, but it became too slow in practise so was refactored to just use the latest versions (maybe in #2472?).

Also, there is static lineage coming in #2624 which further leans into the "just current versions" thing but also is an opportunity to solidify two distinct flavours of lineage - static vs cumulative.

Finally, somewhat related is the idea of "time travel", referenced in:

Proposal

Support returning a cumulative graph based on all runs. This could be with an optional query parameter, something like ?cumulative=true, or (more likely?) a new endpoint.

This would involve a different query which we would need to try and make perform acceptably. The performance issues may be mitigated to some degree by the recent addition of retention and cleanup functionality.

Also, we discussed the idea of whether restricting the time period may affect query performance, and if so whether a default of e.g. 7 days with an optional parameter to look further back might allow for good user experiences to be built around this.

Finally, a lot performance could be gained by making this endpoint just return the node ids and edges - the metadata for involved jobs and datasets can then be pulled async in separate requests.

Another consideration is column lineage - could this also be made to offer static vs cumulative flavours? This could be difficult given the column-level lineage relations exist directly between dataset versions and fields.

("Cumulative" is not necessarily the best word for this, but it's the one I keep thinking of.)

@wslulciuc wslulciuc added this to Marquez Dec 7, 2023
@wslulciuc wslulciuc added this to the Roadmap milestone Dec 7, 2023
@wslulciuc wslulciuc moved this to Todo in Marquez Dec 7, 2023
@davidjgoss davidjgoss changed the title Lineage endpoint should provide cumulative lineage Support for a cumulative lineage graph at both levels Dec 8, 2023
@wslulciuc wslulciuc modified the milestones: Roadmap, 0.46.0 Jan 16, 2024
@davidjgoss
Copy link
Contributor Author

davidjgoss commented Jan 25, 2024

IMO, for an API response here, we could go as simple as just a list of source/target pairs of node ids:

{
  "edges": [
    {
      "source": "dataset:my-namespace:first-input",
      "target": "job:my-namespace:fancy-job"
    },
    {
      "source": "dataset:my-namespace:second-input",
      "target": "job:my-namespace:fancy-job"
    },
    {
      "source": "job:my-namespace:fancy-job",
      "target": "dataset:my-namespace:output-table"
    }
  ]
}

The rest of what you need to render a useful graph can be grabbed async from other endpoints with those ids.

@wslulciuc
Copy link
Member

@davidjgoss I've responded partially to your proposal (and what we plan as a solution) in #2543 (see comments). First, you've done an amazing job at outlining the context. I'll touch upon your comments below.

Finally, a lot performance could be gained by making this endpoint just return the node ids and edges - the metadata for involved jobs and datasets can then be pulled async in separate requests.

I agree. We should have two modes here: light and heavy (I can't think of better names). Anyways, we should've introduced this sooner. A light lineage query would return only the nodeIDs (as you've outlined above) while heavy would return lineage with metadata for dataset and job nodes pre-fetched.

Another consideration is column lineage - could this also be made to offer static vs cumulative flavours? This could be difficult given the column-level lineage relations exist directly between dataset versions and fields.

Yes, but it doesn't have to. I think we should evaluate the level of effort to make this an option! Let's discuss further and work together on a proposal.

@wslulciuc wslulciuc modified the milestones: 0.46.0, 0.52.0 Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants