Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to add custom facet in OpenLineage events #38982

Merged
merged 22 commits into from
Jul 22, 2024

Conversation

anandhimurali
Copy link
Contributor

@anandhimurali anandhimurali commented Apr 12, 2024

Motivation
Currently, users need to write custom extractor inorder to add custom facets in the lineage events. Instead, providing a way to inject custom facets by accepting a config.

Modifications
The get_custom_facets function in the OpenLineage provider is enhanced to fetch the Airflow config AIRFLOW__OPENLINEAGE__CUSTOM_RUN_FACETS, execute those functions and append their return values to the run facet of the lineage event for all the operators.

Result
Enables users to add custom facets to the lineage events by defining functions instead of custom extractor.

Configs documentation for AIRFLOW__OPENLINEAGE__CUSTOM_RUN_FACETS

image

User Docs

image image image

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link

boring-cyborg bot commented Apr 12, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@anandhimurali anandhimurali marked this pull request as ready for review April 13, 2024 00:47
@XD-DENG
Copy link
Member

XD-DENG commented Apr 18, 2024

Hi @eladkal , @anandhimurali is my colleague from the same team and I'm really excited about her 1st PR to Airflow!

Do you have an idea who among the committers may be the best to help review this openlineage-related change?

Any suggestion/input would be greatly appreciated. Thanks a lot!

@eladkal
Copy link
Contributor

eladkal commented Apr 18, 2024

Ping @mobuchowski for review :)

@mobuchowski
Copy link
Contributor

Hey @anandhimurali @XD-DENG I think ability to add custom facets to arbitrary operators is a great idea.

I have two points I think it's worth addressing:

  1. Whether we pass all the necessary info. I think given task_instance we can call get_template_context and get all the data, so this might not be an issue.
  2. Whether this acts "per DAG/Task" or it's a mechanism that works for all the tasks equally - this might be defined inside the function so it still might not be an issue other than complicating the function a bit more.

Overall I like the solution, but might be good to specify the goals regarding 1) and 2) - because for sure, later, someone will come and ask "how to make this custom facet work only on some particular tasks" and "how do I pass some custom info to my function".

Copy link
Contributor

@kacpermuda kacpermuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea 🚀 Apart from what @mobuchowski mentioned above, I added some small comments about the code itself.

airflow/providers/openlineage/conf.py Outdated Show resolved Hide resolved
airflow/providers/openlineage/utils/utils.py Outdated Show resolved Hide resolved
airflow/providers/openlineage/provider.yaml Outdated Show resolved Hide resolved
airflow/providers/openlineage/utils/utils.py Outdated Show resolved Hide resolved
@kacpermuda
Copy link
Contributor

Hey @anandhimurali @XD-DENG, is there anything i could help you with here? I really like the idea and would like to see it happen, let me know 😄

@anandhimurali
Copy link
Contributor Author

Hey @anandhimurali @XD-DENG, is there anything i could help you with here? I really like the idea and would like to see it happen, let me know 😄

Hi @kacpermuda, Thanks for the interest and I'm excited to hear that. Sorry I've been busy for the last few weeks. I'll look into the comments this week.

@anandhimurali
Copy link
Contributor Author

I have two points I think it's worth addressing:

  1. Whether we pass all the necessary info. I think given task_instance we can call get_template_context and get all the data, so this might not be an issue.
  2. Whether this acts "per DAG/Task" or it's a mechanism that works for all the tasks equally - this might be defined inside the function so it still might not be an issue other than complicating the function a bit more.

Overall I like the solution, but might be good to specify the goals regarding 1) and 2) - because for sure, later, someone will come and ask "how to make this custom facet work only on some particular tasks" and "how do I pass some custom info to my function".

Hi @mobuchowski,

  1. Yes, I think task_instance is sufficient to fetch necessary info about the run context.
  2. The custom_facet_functions defined works equally for all tasks. Yes, user can choose to return None for scenarios where they won't to add the custom facet.

Let me know if this needs to be documented anywhere. Thanks.

Copy link
Contributor

@kacpermuda kacpermuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you already mentioned, i think this should be documented for the users. We should include the function signature with type hints, so that users know how to properly implement it + some examples from our side.

Probably here and here

airflow/providers/openlineage/utils/utils.py Outdated Show resolved Hide resolved
airflow/providers/openlineage/utils/utils.py Outdated Show resolved Hide resolved
@mobuchowski
Copy link
Contributor

@anandhimurali any interest in getting this PR into shape and approved? I think this is good, really useful feature.

@anandhimurali
Copy link
Contributor Author

@anandhimurali any interest in getting this PR into shape and approved? I think this is good, really useful feature.

@mobuchowski Yes, definitely. I just have to add some documentation and will get this done this week. Thanks and sorry for the delay.

@anandhimurali anandhimurali force-pushed the ol-custom-facet branch 6 times, most recently from 4f193aa to 0d63f59 Compare July 3, 2024 05:02
@ahidalgob
Copy link
Contributor

ahidalgob commented Jul 3, 2024

I think it would be good to rename this functionality to custom_run_facets, we might need to inject other facets too, specifically Job.facets.

This is a great feature:)

Copy link
Contributor

@kacpermuda kacpermuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@JDarDagran JDarDagran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small thing, otherwise great job! 🚀

@mobuchowski mobuchowski merged commit e30f810 into apache:main Jul 22, 2024
52 checks passed
Copy link

boring-cyborg bot commented Jul 22, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@mobuchowski
Copy link
Contributor

Thanks @anandhimurali for implementing this feature 🚀

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
* Ability to add custom facet in OpenLineage events

* Update airflow/providers/openlineage/provider.yaml

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>

* Update airflow/providers/openlineage/provider.yaml

Co-authored-by: Kacper Muda <mudakacper@gmail.com>

* Adding None type hint for the custom facet function

* Fix a test after rebase

* Removed the legacy OPENLINEAGE_ configs format for OPENLINEAGE_CUSTOM_FACET_FUNCTIONS

* Duplicate facet key check

* Update airflow/providers/openlineage/utils/utils.py

Co-authored-by: Kacper Muda <mudakacper@gmail.com>

* Update airflow/providers/openlineage/utils/utils.py

Co-authored-by: Kacper Muda <mudakacper@gmail.com>

* Fixes after rebase

* Adding user docs for custom_facet_functions

* Rename custom_facet_functions as custom_run_facets

* Increment version for custom_run_facets feature

* Enrich example with access to operator and return value as None.

* Add try-except for custom facet function execution

* Fix the typing for the custom facet fucntion return type

* Documentation: funcs are executed only for START events

* Fix the typing for the custom facet function return type

* Fixes after pre-commit hook checks

* Adding start_date to test DAGs for 2.7 compatibility tests

* Removing a out of scope __init__ file added by pre-commit check

---------

Co-authored-by: Anandhi <anandhi_murali@apple.com>
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Kacper Muda <mudakacper@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants