Fix OpenLineage extraction for deferrable AthenaOperator #40545

kacpermuda · 2024-07-02T11:35:48Z

Ol extraction in deferrable mode is not using output location from API response. This PR adjusts it, so that the behaviour is the same regardless of deferrable mode being on or off.
This PR also adds ExternalQueryRunFacet for Athena job and fixes the problem where the whole extraction failed if one of the columns in Athena was missing a description.

There is also something i'd like to discuss. Right now, the output datasets are different between START and COMPLETE events. When passing output_location="s3://<bucket>/dir/ to AthenaOperator,

in START event we get one output: Dataset(namespace="s3://<bucket>", name="dir")

in COMPLETE event we get two outputs: Dataset(namespace="awsathena://athena.eu-central-1.amazonaws.com", name="AwsDataCatalog.<db>.<table>") and Dataset(namespace="s3://<bucket>", name="dir/tables/2ee1bfeb-4c67-4f50-a49d-df5deeb5f034").

It's because in COMPLETE we are using the output from the API, so we know the exact location in S3 where the output has been saved so we replace whatever user has provided with what we get from the API. I wonder if it's something we expected and should stay like this or maybe we should somehow change that, to make it consistent in both events? I believe some people are using this first dataset based on this issue.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

kacpermuda · 2024-07-02T14:03:46Z

We are still discussing the mismatch between output datasets, let's hold with merging this one.

Signed-off-by: Kacper Muda <mudakacper@gmail.com>

kacpermuda · 2024-07-04T09:11:43Z

I've changed the Ol method to _on_complete() as most of the information needs the call to Athena anyway. I also removed the actual S3 location (that was sent in complete event) and replaced it with user provided prefix (that was sent in start event). I think it's ready now.

Signed-off-by: Kacper Muda <mudakacper@gmail.com>

kacpermuda requested review from eladkal and o-nikolas as code owners July 2, 2024 11:35

boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Jul 2, 2024

kacpermuda force-pushed the fix-ol-deferrable-athena branch 2 times, most recently from 69f4b59 to 24462cd Compare July 2, 2024 12:03

kacpermuda changed the title ~~Improve OpenLineage extraction for deferrable AthenaOperator~~ Fix OpenLineage extraction for deferrable AthenaOperator Jul 2, 2024

kacpermuda force-pushed the fix-ol-deferrable-athena branch from 24462cd to f023334 Compare July 2, 2024 13:28

potiuk approved these changes Jul 2, 2024

View reviewed changes

kacpermuda force-pushed the fix-ol-deferrable-athena branch from f023334 to b835a4f Compare July 2, 2024 13:52

eladkal approved these changes Jul 2, 2024

View reviewed changes

kacpermuda marked this pull request as draft July 2, 2024 14:03

fix OpenLineage extraction for AthenaOperator

30a609a

Signed-off-by: Kacper Muda <mudakacper@gmail.com>

kacpermuda force-pushed the fix-ol-deferrable-athena branch from b835a4f to 30a609a Compare July 4, 2024 08:46

kacpermuda marked this pull request as ready for review July 4, 2024 09:06

potiuk approved these changes Jul 4, 2024

View reviewed changes

potiuk merged commit b7d0bf9 into apache:main Jul 4, 2024
51 checks passed

kacpermuda deleted the fix-ol-deferrable-athena branch July 4, 2024 09:20

This was referenced Jul 9, 2024

Status of testing Providers that were prepared on July 09, 2024 #40661

Closed

Status of testing Providers that were prepared on July 12, 2024 #40752

Closed

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024

fix OpenLineage extraction for AthenaOperator (apache#40545)

c547546

Signed-off-by: Kacper Muda <mudakacper@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenLineage extraction for deferrable AthenaOperator #40545

Fix OpenLineage extraction for deferrable AthenaOperator #40545

kacpermuda commented Jul 2, 2024 •

edited

Loading

kacpermuda commented Jul 2, 2024

kacpermuda commented Jul 4, 2024

Fix OpenLineage extraction for deferrable AthenaOperator #40545

Fix OpenLineage extraction for deferrable AthenaOperator #40545

Conversation

kacpermuda commented Jul 2, 2024 • edited Loading

kacpermuda commented Jul 2, 2024

kacpermuda commented Jul 4, 2024

kacpermuda commented Jul 2, 2024 •

edited

Loading