Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OpenLineage extraction for deferrable AthenaOperator #40545

Merged
merged 1 commit into from
Jul 4, 2024

Conversation

kacpermuda
Copy link
Contributor

@kacpermuda kacpermuda commented Jul 2, 2024

Ol extraction in deferrable mode is not using output location from API response. This PR adjusts it, so that the behaviour is the same regardless of deferrable mode being on or off.
This PR also adds ExternalQueryRunFacet for Athena job and fixes the problem where the whole extraction failed if one of the columns in Athena was missing a description.

There is also something i'd like to discuss. Right now, the output datasets are different between START and COMPLETE events. When passing output_location="s3://<bucket>/dir/ to AthenaOperator,

in START event we get one output: Dataset(namespace="s3://<bucket>", name="dir")

in COMPLETE event we get two outputs: Dataset(namespace="awsathena://athena.eu-central-1.amazonaws.com", name="AwsDataCatalog.<db>.<table>") and Dataset(namespace="s3://<bucket>", name="dir/tables/2ee1bfeb-4c67-4f50-a49d-df5deeb5f034").

It's because in COMPLETE we are using the output from the API, so we know the exact location in S3 where the output has been saved so we replace whatever user has provided with what we get from the API. I wonder if it's something we expected and should stay like this or maybe we should somehow change that, to make it consistent in both events? I believe some people are using this first dataset based on this issue.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Jul 2, 2024
@kacpermuda kacpermuda force-pushed the fix-ol-deferrable-athena branch 2 times, most recently from 69f4b59 to 24462cd Compare July 2, 2024 12:03
@kacpermuda kacpermuda changed the title Improve OpenLineage extraction for deferrable AthenaOperator Fix OpenLineage extraction for deferrable AthenaOperator Jul 2, 2024
@kacpermuda kacpermuda marked this pull request as draft July 2, 2024 14:03
@kacpermuda kacpermuda marked this pull request as draft July 2, 2024 14:03
@kacpermuda
Copy link
Contributor Author

We are still discussing the mismatch between output datasets, let's hold with merging this one.

Signed-off-by: Kacper Muda <mudakacper@gmail.com>
@kacpermuda kacpermuda marked this pull request as ready for review July 4, 2024 09:06
@kacpermuda
Copy link
Contributor Author

I've changed the Ol method to _on_complete() as most of the information needs the call to Athena anyway. I also removed the actual S3 location (that was sent in complete event) and replaced it with user provided prefix (that was sent in start event). I think it's ready now.

@potiuk potiuk merged commit b7d0bf9 into apache:main Jul 4, 2024
51 checks passed
@kacpermuda kacpermuda deleted the fix-ol-deferrable-athena branch July 4, 2024 09:20
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants