Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add CLL to OpenLineage in BigQueryInsertJobOperator #44872

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kacpermuda
Copy link
Contributor

@kacpermuda kacpermuda commented Dec 12, 2024

BigQueryInsertJobOperator already support OpenLineage for QUERY type jobs, but it lacks Column Level Lineage (CLL).

This PR introduces CLL (Column-Level Lineage) to this operator based on SQL parsing, which can be useful in straightforward scenarios. However, since SQL parsing alone might not always provide all the details (e.g. in SQL query we can reference table only by table name, or dataset.table without the project_id), checks have been implemented to ensure accurate lineage. As a result CLL may not be included when there is uncertainty about its correctness.

There is another change not related to CLL: right now output table is duplicated into input tables. We are creating a list of input tables based on referencedTables property provided by Google and as it turns out, this also includes the destination table. So f.e. this query:
INSERT INTO a.b.c VALUES (1, "a", 23)
would return a.b.c as input table and output table.

This PR fixes it by removing output table from input tables. I am not sure if it's a correct approach as sometimes users may write a query that performs a process that moves data from one table to the same table but i think this is rare and also this kind of lineage information (from A to A) does not provide much value. Please let me know if you think I'm wrong.

I also refactored the mixin a bit to make it clearer and prepare for adding support for job types other than QUERY. I also change the class name - in the beginning it's supposed to be a general mixin, but BigQueryInsertJobOperator is so complex that this mixin will only be used with that class.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Dec 12, 2024
@kacpermuda kacpermuda force-pushed the feat-ol-bq-insert-job-cll branch 5 times, most recently from 79c6ad4 to b383b93 Compare December 13, 2024 15:19
@kacpermuda kacpermuda marked this pull request as ready for review December 13, 2024 15:19
@kacpermuda kacpermuda marked this pull request as draft December 13, 2024 15:28
@kacpermuda kacpermuda force-pushed the feat-ol-bq-insert-job-cll branch 2 times, most recently from 3dd80a3 to 97860a3 Compare December 16, 2024 10:09
@kacpermuda kacpermuda marked this pull request as ready for review December 16, 2024 10:10
@kacpermuda kacpermuda force-pushed the feat-ol-bq-insert-job-cll branch 4 times, most recently from d123fb0 to 0edb6d4 Compare December 16, 2024 16:41
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
@kacpermuda kacpermuda force-pushed the feat-ol-bq-insert-job-cll branch from 0edb6d4 to 52be5f9 Compare December 16, 2024 17:29
Copy link
Contributor

@mobuchowski mobuchowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - one question that would be nice to answer.

@@ -167,6 +211,21 @@ def _deduplicate_outputs(self, outputs: list[OutputDataset | None]) -> list[Outp
# if the rowCount or size can be summed together.
if single_output.outputFacets:
single_output.outputFacets.pop("outputStatistics", None)

# If both outputs contain Column Level Lineage Facet - merge the facets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this be true?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants