Move lineage from airflow core to task sdk#60968
Conversation
There was a problem hiding this comment.
That looks good to me, thanks @amoghrajesh!
Couple of tests to fix though.
providers/openlineage/src/airflow/providers/openlineage/plugins/openlineage.py
Show resolved
Hide resolved
potiuk
left a comment
There was a problem hiding this comment.
One thing about # use next version
jason810496
left a comment
There was a problem hiding this comment.
Nice! LGTM once the CI pass.
|
CI's been a challenge on this one, getting there |
Backport failed to create: v3-1-test. View the failure log Run details
You can attempt to backport this manually by running: cherry_picker e871af6 v3-1-testThis should apply the commit to the v3-1-test branch and leave the commit in conflict state marking After you have resolved the conflicts, you can continue the backport process by running: cherry_picker --continueIf you don't have cherry-picker installed, see the installation guide. |
|
No need to backport. |
|
@amoghrajesh I am terribly sorry but canary failed and needed to revert. Can you re-apply and make "all versions" and "full tests" as labels? Seems the test selection did not catch all. |
Lineage collection happens exclusively during task execution and is only used by worker processes. Server components such as the scheduler and API server do not use it. Move the lineage module from airflow-core to the task SDK to better align with the ongoing client–server separation.
…pache#61151) This reverts commit e871af6.
Lineage collection happens exclusively during task execution and is only used by worker processes. Server components such as the scheduler and API server do not use it. Move the lineage module from airflow-core to the task SDK to better align with the ongoing client–server separation.
…pache#61151) This reverts commit e871af6.
Lineage collection happens exclusively during task execution and is only used by worker processes. Server components such as the scheduler and API server do not use it. Move the lineage module from airflow-core to the task SDK to better align with the ongoing client–server separation.
…pache#61151) This reverts commit e871af6.
Lineage collection happens exclusively during task execution and is only used by worker processes. Server components such as the scheduler and API server do not use it. Move the lineage module from airflow-core to the task SDK to better align with the ongoing client–server separation.
…pache#61151) This reverts commit e871af6.
Was generative AI tooling used to co-author this PR?
Why?
Lineage collection is a task execution concern and on checking it only runs on workers (task sdk consumer) processes, not in any server components (scheduler, api server). I intend to move lineage module from airflow-core to task-sdk as part of the ongoing client server separation work.
Some more context:
io/path.pyintercepts file I/O during task executionWhat is done?
sdk/lineage.pyairflow.lineage.hookProvidersManager→ProvidersManagerTaskRuntimeairflow.utils.log.logging_mixin→airflow.sdk.definitions._internal.logging_mixinget_hook_lineage_readers_plugins()using SDK's plugin discoveryBackward Compatibility
For core -
from airflow.lineage.hook import Xandfrom airflow.lineage import hookProvider compatibility has been handled with
providers/common/compat/src/airflow/providers/common/compat/sdk.pyDatasetLineageInfo→AssetLineageInforename (AF2 → AF3)Removed from core -
airflow-core/src/airflow/plugins_manager.pyget_hook_lineage_readers_plugins()functionFor provider developers, it is recommended to use imports from
airflow.providers.common.compat.sdkTesting
To gain confidence I tried to test a manual e2e scenario for this. Ran breeze with OL integration:
breeze start-airflow --integration openlineageDAG:
Its a simple dag that does this:
file:///input/data.csvfile:///output/result.csvget_hook_lineage_collector()to register assetsDAG run:

Marquez:
