feat: Add CLL to OpenLineage in BigQueryInsertJobOperator #44872
+914
−161
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BigQueryInsertJobOperator already support OpenLineage for QUERY type jobs, but it lacks Column Level Lineage (CLL).
This PR introduces CLL (Column-Level Lineage) to this operator based on SQL parsing, which can be useful in straightforward scenarios. However, since SQL parsing alone might not always provide all the details (e.g. in SQL query we can reference table only by table name, or dataset.table without the project_id), checks have been implemented to ensure accurate lineage. As a result CLL may not be included when there is uncertainty about its correctness.
There is another change not related to CLL: right now output table is duplicated into input tables. We are creating a list of input tables based on
referencedTables
property provided by Google and as it turns out, this also includes the destination table. So f.e. this query:INSERT INTO
a.b.cVALUES (1, "a", 23)
would return
a.b.c
as input table and output table.This PR fixes it by removing output table from input tables. I am not sure if it's a correct approach as sometimes users may write a query that performs a process that moves data from one table to the same table but i think this is rare and also this kind of lineage information (from A to A) does not provide much value. Please let me know if you think I'm wrong.
I also refactored the mixin a bit to make it clearer and prepare for adding support for job types other than QUERY. I also change the class name - in the beginning it's supposed to be a general mixin, but BigQueryInsertJobOperator is so complex that this mixin will only be used with that class.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.