Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for ol facet tables #2076

Merged
merged 12 commits into from
Dec 21, 2022

Conversation

wslulciuc
Copy link
Member

@wslulciuc wslulciuc commented Aug 18, 2022

This PR adds the proposal: Optimize query performance for OpenLineage facets

@codecov
Copy link

codecov bot commented Aug 18, 2022

Codecov Report

Merging #2076 (28a07e9) into main (1d28adf) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2076   +/-   ##
=========================================
  Coverage     76.72%   76.72%           
  Complexity     1177     1177           
=========================================
  Files           222      222           
  Lines          5354     5354           
  Branches        429      429           
=========================================
  Hits           4108     4108           
  Misses          768      768           
  Partials        478      478           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wslulciuc wslulciuc marked this pull request as ready for review August 19, 2022 07:30
Copy link
Collaborator

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Can you add specifics on how conflicts will be dealt with? Are these facet tables append-only? E.g., if two facets with the same name (I assume the name column in the tables refers to the facet name) but different contents are received for the same dataset version (e.g., two different GreatExpectations suites on the same data), will both records be added to the tables? If not, which one wins? First received? Last received? If both records are inserted, will they be merged at query time? Or just appended one after the other? Are there indexes on these tables? In particular, the runs table can get very, very large. The same will be true of the runs_facets table eventually.

@wslulciuc
Copy link
Member Author

@collado-mike: All great questions that I'll elaborate on in the proposal!


OpenLineage's core model is extensible via _facets_. A `facet` is user-defined metadata and enables entity enrichment. Initially, returning dataset, job, and run facets via the REST API was not supported, but eventually added in release [`0.14.0`](https://github.com/MarquezProject/marquez/compare/0.13.1...0.14.0). The implementation was simple: when querying the `datasets`, `jobs`, or `runs` tables, also query the `lineage_events` table for facets.

We knew the initial implementation would have to eventually be revisited. That is, OpenLineage events can easily exceed **>** **`10MBs`** resulting in out-of-memory (OOM) errors as facet queries require loading the raw `event` in memory, then filtering for relevant facets. This proposal outlines how we can optimize query performance for OpenLineage facets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, we load the events to memory in Marquez and do facet filtering there.
If so, the solution should be to offload filtering to database. The short term solution could be filtering json content in postgres as described here (please note indexing json content is also possible). In this particular example we could select output dataset facets from event json within postgresql, instead of selecting whole events.

But lineage_events will grow over time and querying for datasets' facets will slow down. Normalizing json facets, as described within this proposal, is a good way to go.

The only problem with separate tables is a backfill procedure which is a heavy operation. I am not sure whether lazy migration would work. The existence of some facets in new tables does not mean we collected all the existing facets, including those from lineage_events table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very reasonable point, @pawel-big-lebowski. But, to get ahead of the issue, like you stated, we'll want to normalize facets.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great, I left some comments

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved
proposals/2078-optimization-ol-facets.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a great proposal which can have huge impact on Marquez performance 🚀
I've put some comments which are worth considering (if not already considered).

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved
proposals/2078-optimization-ol-facets.md Show resolved Hide resolved
proposals/2078-optimization-ol-facets.md Show resolved Hide resolved
Note, facet tables will be:

* Append only, mirroring the current insertion pattern of the `lineage_events` table; therefore, avoiding facet conflicts
* Merging facets will follow a _first-to-last_ received order; meaning, facet rows will be merged post query using [`MapperUtils.toFacetsOrNull()`](https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/mappers/MapperUtils.java#L50) mirroring the current logic (i.e. newer facets will be added or override older facet values based on when the OpenLineage event was received)
Copy link
Contributor

@mobuchowski mobuchowski Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases it might make sense to accumulate content inside the facets themselves.

For example, streaming Flink job might report how many records were processed per checkpoint - sometimes sending multiple results per OL event. The result facet should return list of those reports.

I don't think we need to address this within this proposal though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, certain facets can be handled differently (i.e. accumulated) based on some context. I'll make a note on the proposals scope / limitations.

Copy link
Contributor

@mobuchowski mobuchowski Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wslulciuc let's also create follow-up issue, will be good to go then on this problem.

@wslulciuc wslulciuc mentioned this pull request Sep 30, 2022
7 tasks
@boring-cyborg boring-cyborg bot added the docs label Nov 15, 2022
@pawel-big-lebowski pawel-big-lebowski self-assigned this Dec 12, 2022
@pawel-big-lebowski pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch 2 times, most recently from e7e553b to 5b95335 Compare December 13, 2022 08:56
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch from 5b95335 to d7f5f4b Compare December 13, 2022 08:59
2. Using the facet tables instead the `lineage_events` table to query for facets.
3. Lazy migration, the facet tables will be queried, and if no facets are returned, then the `lineage_events` table; this approach avoids a backfill, but one will still be needed.

## Migration procedure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposals/2078-optimization-ol-facets.md Outdated Show resolved Hide resolved
proposals/2078-optimization-ol-facets.md Show resolved Hide resolved
@pawel-big-lebowski pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch from d7f5f4b to 48bb84c Compare December 21, 2022 15:34
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch from 48bb84c to 3c5ebed Compare December 21, 2022 15:43
@pawel-big-lebowski pawel-big-lebowski merged commit a854702 into main Dec 21, 2022
@pawel-big-lebowski pawel-big-lebowski deleted the proposal/separate-facets-tables-for-ol-events branch December 21, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants