Support for a cumulative lineage graph at both levels #2670

davidjgoss · 2023-10-28T15:22:41Z

Problem

Currently, there is a mismatch between how the dataset/job-level and column-level lineage endpoints behave:

/v1/lineage returns a graph of datasets and jobs only considering the latest job version (i.e. the last run) of involved jobs
/v1/column-lineage returns all column-level relations from all runs

With the first endpoint, this means if a job's behaviour is variable between runs (e.g. sometimes it touches a dataset, and sometimes it doesn't), you can have datasets drop off the graph even though the column-level relation will still be reflected on the second endpoint. This means that if you want to combine the graphs to form a dataset- and column-level visualisation of lineage, things can be a bit weird e.g. a column-to-column relationship could be seen but without a corresponding dataset-job-dataset relation.

More generally, seeing the cumulative lineage graph based on all historical runs would be desirable, where the user is trying to ascertain what has actually been happening over time rather than the current state of things.

Context

This has been reported as an issue a few other times:

As discussed with @wslulciuc and @collado-mike recently, the /v1/lineage endpoint used to have this cumulative behaviour, but it became too slow in practise so was refactored to just use the latest versions (maybe in #2472?).

Also, there is static lineage coming in #2624 which further leans into the "just current versions" thing but also is an opportunity to solidify two distinct flavours of lineage - static vs cumulative.

Finally, somewhat related is the idea of "time travel", referenced in:

Proposal

Support returning a cumulative graph based on all runs. This could be with an optional query parameter, something like ?cumulative=true, or (more likely?) a new endpoint.

This would involve a different query which we would need to try and make perform acceptably. The performance issues may be mitigated to some degree by the recent addition of retention and cleanup functionality.

Also, we discussed the idea of whether restricting the time period may affect query performance, and if so whether a default of e.g. 7 days with an optional parameter to look further back might allow for good user experiences to be built around this.

Finally, a lot performance could be gained by making this endpoint just return the node ids and edges - the metadata for involved jobs and datasets can then be pulled async in separate requests.

Another consideration is column lineage - could this also be made to offer static vs cumulative flavours? This could be difficult given the column-level lineage relations exist directly between dataset versions and fields.

("Cumulative" is not necessarily the best word for this, but it's the one I keep thinking of.)

The text was updated successfully, but these errors were encountered:

davidjgoss · 2024-01-25T17:12:13Z

IMO, for an API response here, we could go as simple as just a list of source/target pairs of node ids:

{
  "edges": [
    {
      "source": "dataset:my-namespace:first-input",
      "target": "job:my-namespace:fancy-job"
    },
    {
      "source": "dataset:my-namespace:second-input",
      "target": "job:my-namespace:fancy-job"
    },
    {
      "source": "job:my-namespace:fancy-job",
      "target": "dataset:my-namespace:output-table"
    }
  ]
}

The rest of what you need to render a useful graph can be grabbed async from other endpoints with those ids.

wslulciuc · 2024-01-30T10:57:01Z

@davidjgoss I've responded partially to your proposal (and what we plan as a solution) in #2543 (see comments). First, you've done an amazing job at outlining the context. I'll touch upon your comments below.

Finally, a lot performance could be gained by making this endpoint just return the node ids and edges - the metadata for involved jobs and datasets can then be pulled async in separate requests.

I agree. We should have two modes here: light and heavy (I can't think of better names). Anyways, we should've introduced this sooner. A light lineage query would return only the nodeIDs (as you've outlined above) while heavy would return lineage with metadata for dataset and job nodes pre-fetched.

Another consideration is column lineage - could this also be made to offer static vs cumulative flavours? This could be difficult given the column-level lineage relations exist directly between dataset versions and fields.

Yes, but it doesn't have to. I think we should evaluate the level of effort to make this an option! Let's discuss further and work together on a proposal.

wslulciuc added this to Marquez Dec 7, 2023

wslulciuc added this to the Roadmap milestone Dec 7, 2023

wslulciuc moved this to Todo in Marquez Dec 7, 2023

davidjgoss changed the title ~~Lineage endpoint should provide cumulative lineage~~ Support for a cumulative lineage graph at both levels Dec 8, 2023

wslulciuc modified the milestones: Roadmap, 0.46.0 Jan 16, 2024

wslulciuc added the column-level-lineage label Jan 30, 2024

wslulciuc modified the milestones: 0.46.0, 0.52.0 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for a cumulative lineage graph at both levels #2670

Support for a cumulative lineage graph at both levels #2670

davidjgoss commented Oct 28, 2023 •

edited

Loading

davidjgoss commented Jan 25, 2024 •

edited

Loading

wslulciuc commented Jan 30, 2024

Support for a cumulative lineage graph at both levels #2670

Support for a cumulative lineage graph at both levels #2670

Comments

davidjgoss commented Oct 28, 2023 • edited Loading

Problem

Context

Proposal

davidjgoss commented Jan 25, 2024 • edited Loading

wslulciuc commented Jan 30, 2024

davidjgoss commented Oct 28, 2023 •

edited

Loading

davidjgoss commented Jan 25, 2024 •

edited

Loading