Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/airflow): airflow plugin v2 #8853

Merged
merged 64 commits into from
Oct 4, 2023

Conversation

hsheth2
Copy link
Collaborator

@hsheth2 hsheth2 commented Sep 18, 2023

Stacked on top of #8861.

Also requires the changes from #8897.

Closes #8892.
Closes #6556.
Closes #7809.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Sep 18, 2023
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. I think this could be more closely integrated with OL, but I understand if that's not desired / could be restrictive in the future.

@@ -18,4 +18,18 @@ def get_provider_info():
"package-name": f"{__package_name__}",
"name": f"{__package_name__}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably pick a more human-friendly name here


@contextlib.contextmanager
def _patch_extractors(self):
with contextlib.ExitStack() as stack:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool, kinda reminds me of go defer

Comment on lines +232 to +258
datajob.inlets.extend(original_datajob.inlets)
datajob.outlets.extend(original_datajob.outlets)
datajob.fine_grained_lineages.extend(original_datajob.fine_grained_lineages)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you build up local vars input_urns, output_urns, and fine_grained_lineages above, but then directly update the datajob here? Could we be consistent here?


# TODO: Add handling for Airflow mapped tasks using task_instance.map_index

datajob.emit(self.emitter, callback=self._make_emit_callback())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter for this PR, but I'd prefer something like self.emitter.emit_entity(datajob, callback) to more closely match our mcp emit call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I agree - I'm not super happy with the whole datajob/dataflow interfaces

Would previously see this from the plugin manager.
```
Exception: The package 'acryl-datahub' from setuptools and acryl-datahub-airflow-plugin do not match. Please make sure they are aligned
```

Also fixes + registers hook types
continue making tweaks

refactor ti + config

refactor get_lineage_config

refactoring

add in special case handling from backport

listener-based plugin

fix hooks + add log_level config

make logging work

more sql fixes

start working on combining lineage

work with old datahub lineage entities

support fine grained lineage

support enable_extractors

tweak stuff

preserve datajob info

fix bugs in emitter

support threaded exec

fix rendering

refactoring

more refactoring

tweak comment

review comments

mypy

fix mypy errors

drop python 3.7

fix lint

fix lint
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing looks good to me

Comment on lines +12 to +14
- Airflow DAG and tasks, including properties, ownership, and tags.
- Task run information, including task successes and failures.
- Manual lineage annotations using `inlets` and `outlets` on Airflow operators.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice these are most of the improvements we were looking for


The Airflow lineage plugin is only supported with Airflow version >= 2.0.2 or on MWAA with an Airflow version >= 2.0.2.
There's two actively supported implementations of the plugin, with different Airflow version support.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do we intend to support both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably for a while - there's no good reason to drop support yet

docs/lineage/airflow.md Outdated Show resolved Hide resolved
@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 4, 2023
@anshbansal anshbansal merged commit a300b39 into datahub-project:master Oct 4, 2023
58 checks passed
@hsheth2 hsheth2 deleted the airflow-plugin branch October 5, 2023 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
3 participants