Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIP-62] Translate AIP-60 URI to OpenLineage #40173

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

kacpermuda
Copy link
Contributor

@kacpermuda kacpermuda commented Jun 11, 2024

closes: #38767

For important changes look at the first commit, then for example implementation look at the second commit.

For Airflow Dataset I've added:

  • _get_normalized_scheme() function that still does scheme.lower() underneath but now we can also use this in OL provider and be sure that we are using the same mechanism everywhere.
  • Dataset.normalized_uri property - so that we can retrieve a normalized and AIP-60 compliant uri or None in all other cases (not an uri, no scheme, no normalizer etc.). At first i thought that in Airflow 3 we could just use Dataset.uri, as it will raise an error when a normalizer fails, but there can still be schems without a normalizer defined so i felt like this is needed.

Also small adjustment to ProvidersManager: I felt like this dataset-uris part of provider.yml is getting complex, so i re-wrote the _discover_dataset_uri_handlers method to be more flexible for future expansions (f.e. OL to AIP-60 converters).

This Pr should be only merged AFTER #40335.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:core-operators Operators, Sensors and hooks within Core Airflow area:lineage area:providers provider:amazon-aws AWS/Amazon - related issues provider:openlineage AIP-53 provider:trino labels Jun 11, 2024
@kacpermuda kacpermuda changed the title WIP - translate AIP-60 URI to OpenLineage WIP [AIP-62] Translate AIP-60 URI to OpenLineage Jun 11, 2024
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch 4 times, most recently from 05096aa to 918cc0d Compare June 19, 2024 08:49
@kacpermuda kacpermuda changed the title WIP [AIP-62] Translate AIP-60 URI to OpenLineage [AIP-62] Translate AIP-60 URI to OpenLineage Jun 19, 2024
@kacpermuda kacpermuda marked this pull request as ready for review June 19, 2024 09:19
@mobuchowski mobuchowski requested a review from uranusjr June 19, 2024 09:23
@mobuchowski mobuchowski added AIP-62 Tasks tracking implementation of AIP-62 Getting Lineage from Hook Instrumentation and removed provider:trino labels Jun 19, 2024
@mobuchowski
Copy link
Contributor

@uranusjr this is how we want to use AIP-60 datasets in OpenLineage

@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch from 918cc0d to 01756fd Compare June 19, 2024 10:30
@kacpermuda kacpermuda marked this pull request as draft July 8, 2024 17:50
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch 3 times, most recently from e0fc1fa to 37490be Compare July 9, 2024 17:22
@kacpermuda
Copy link
Contributor Author

I believe this PR is ready to be reviewed, however it should probably NOT be merged before #40335 gets merged.

@kacpermuda kacpermuda marked this pull request as ready for review July 10, 2024 10:24
@kacpermuda kacpermuda requested a review from bolkedebruin as a code owner July 10, 2024 10:24
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch from 37490be to a0b337f Compare July 10, 2024 10:45
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch 6 times, most recently from 741c4b8 to 84c1b09 Compare July 11, 2024 13:47
airflow/datasets/__init__.py Outdated Show resolved Hide resolved
airflow/datasets/__init__.py Outdated Show resolved Hide resolved
airflow/datasets/__init__.py Outdated Show resolved Hide resolved
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch 3 times, most recently from f894908 to 29fe565 Compare July 16, 2024 07:50
@kacpermuda kacpermuda requested a review from uranusjr July 16, 2024 09:30
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch 4 times, most recently from f57c53e to 9dbc9ea Compare July 22, 2024 10:13
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch 2 times, most recently from 3a17c9f to b2ba021 Compare July 22, 2024 17:27
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
@kacpermuda kacpermuda force-pushed the aip62-ol-dataset-mapping branch from b2ba021 to bd2ef8c Compare July 22, 2024 18:07
@mobuchowski mobuchowski merged commit 8a912f9 into apache:main Jul 23, 2024
53 checks passed
@kacpermuda kacpermuda deleted the aip62-ol-dataset-mapping branch July 23, 2024 09:04
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jul 24, 2024
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 16, 2024
Fix unit tests:
  - test_does_not_double_import_entrypoint_provider_plugins - in apache-airflow-providers-databricks==6.8.0 was added DatabricksWorkflowPlugin (apache/airflow#40724)
  - test_dataset - in apache-airflow-providers-amazon==8.27.0 changed
    Dataset URI format validation (apache/airflow#40173)

Change-Id: Iae902e544aae2086ea4495b0850c19f813aa7069
GitOrigin-RevId: 7d5b7a9ead32610f7e3864230e55bb3a17bf6da5
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 20, 2024
Changes:
- add suffix +composer to version
- remove http and sqlite from pre-installed providers as they are included to Composer dependencies already
- add pre-commit configuration file
- set Composer pypi dependencies
- adjust Airflow configs required for unit tests in order to prevent them from being cleaned up during testing
- fix test_dataset as in apache-airflow-providers-amazon==8.27.0 changed Dataset URI format validation (apache/airflow#40173)

Change-Id: Iac6842a49929d9e2c4b8ed29353312dbc450de8a
GitOrigin-RevId: 1680fdc22961fa22517b2fd21ca67e8240e1f16a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-62 Tasks tracking implementation of AIP-62 Getting Lineage from Hook Instrumentation area:core-operators Operators, Sensors and hooks within Core Airflow area:lineage area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) provider:amazon-aws AWS/Amazon - related issues provider:openlineage AIP-53
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define and implement core AIP-60 to OpenLineage dataset mapping
6 participants