Fix OpenLineage extraction for deferrable AthenaOperator #40545
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ol extraction in deferrable mode is not using output location from API response. This PR adjusts it, so that the behaviour is the same regardless of deferrable mode being on or off.
This PR also adds ExternalQueryRunFacet for Athena job and fixes the problem where the whole extraction failed if one of the columns in Athena was missing a description.
There is also something i'd like to discuss. Right now, the output datasets are different between START and COMPLETE events. When passing
output_location="s3://<bucket>/dir/
to AthenaOperator,in START event we get one output:
Dataset(namespace="s3://<bucket>", name="dir")
in COMPLETE event we get two outputs:
Dataset(namespace="awsathena://athena.eu-central-1.amazonaws.com", name="AwsDataCatalog.<db>.<table>")
andDataset(namespace="s3://<bucket>", name="dir/tables/2ee1bfeb-4c67-4f50-a49d-df5deeb5f034")
.It's because in COMPLETE we are using the output from the API, so we know the exact location in S3 where the output has been saved so we replace whatever user has provided with what we get from the API. I wonder if it's something we expected and should stay like this or maybe we should somehow change that, to make it consistent in both events? I believe some people are using this first dataset based on this issue.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.