Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for labelling DAG edges #15142

Merged
merged 1 commit into from
Apr 9, 2021

Conversation

andrewgodwin
Copy link
Contributor

@andrewgodwin andrewgodwin commented Apr 1, 2021

This adds support for putting human-readable labels on edges in the DAG between Tasks, as well as making the underlying framework for that generic enough that future metadata could be added if desired.

What's left to do:

  • Add support for XComArgs
  • Add either support or a nice error for TaskGroups
  • Add tests
  • Add documentation
  • Add custom serialisation

It modifies both the GraphViz and the D3 renderers - example:

image

from airflow.models import Label
...
data = fetch_data()
mean_height = calculate_mean_height(data)
mean_height >> Label("Mean") >> print_result
max_height = calculate_max_height(data)
max_height.set_downstream(print_result_again, Label("Max"))

closes: #15140

@boring-cyborg boring-cyborg bot added area:serialization area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues labels Apr 1, 2021
@ashb
Copy link
Member

ashb commented Apr 1, 2021

from airflow.models import Label

I'm not a fan of this import for in user dags (they generally don't import from models currently) but I don't have an immediate other suggestion.

(Just a comment on "user" import not code location)

@andrewgodwin
Copy link
Contributor Author

from airflow.models import Label

I'm not a fan of this import for in user dags (they generally don't import from models currently) but I don't have an immediate other suggestion

Yeah, I went trawling around some example DAGs to find what path might fit, and I saw a couple with models imports for Variable and so forth, so I went with that initially. I'm not sure what airflow.task is for - seems pretty empty - but that was something else that felt feasible.

@ashb
Copy link
Member

ashb commented Apr 1, 2021

To add to your Todo list: update the SerialisedDag representations to include this

@andrewgodwin
Copy link
Contributor Author

To add to your Todo list: update the SerialisedDag representations to include this

That's actually done already, or it wouldn't make it across to the UI. Unless there's extra work above and beyond just making it appear in the schema & json?

@ashb
Copy link
Member

ashb commented Apr 1, 2021

To add to your Todo list: update the SerialisedDag representations to include this

That's actually done already, or it wouldn't make it across to the UI. Unless there's extra work above and beyond just making it appear in the schema & json?

Nope, I just missed it then! GitHub mobile is not the best at displaying PRs

@@ -102,7 +102,8 @@
"_task_group": {"anyOf": [
{ "type": "null" },
{ "$ref": "#/definitions/task_group" }
]}
]},
"edge_info": { "$ref": "#/definitions/dict" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this could maybe be specified a bit tighter. WDYT @kaxil ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea similar to how we only store _downstream_task_ids (not the _upstream_task_ids) for a task and we can set upstream for the other when de-serializing. We do that to keep the size of serialized blog as small as possible.

if k == "_downstream_task_ids":
v = set(v)

for k, v in encoded_dag.items():
if k == "_downstream_task_ids":
v = set(v)

for task_id in serializable_task.downstream_task_ids:
# Bypass set_upstream etc here - it does more than we want
# noqa: E501 # pylint: disable=protected-access
dag.task_dict[task_id]._upstream_task_ids.add(serializable_task.task_id)

cls.__serialized_fields = frozenset(
vars(BaseOperator(task_id='test')).keys()
- {
'inlets',
'outlets',
'_upstream_task_ids',

We pay the price of hardcoding it but it is worth it as we can save MBs (which would be transmitted to and from the database) when number of tasks is huge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a slightly more compact representation. I can't save any actual data, since source-target-label is all that's stored, but we can save the outside wrapping with _type etc. at least?

@uranusjr
Copy link
Member

uranusjr commented Apr 1, 2021

from airflow.models import Label

How about hiding the Label class from the user entirely? Something like task1 << "Label" << task2 and create the label automatically in _set_relatives().

@andrewgodwin
Copy link
Contributor Author

from airflow.models import Label

How about hiding the Label class from the user entirely? Something like task1 << "Label" << task2 and create the label automatically in _set_relatives().

I'd not be a huge fan of that design for two reasons:

  • If we ever want to add edge information beyond just a label (e.g. add a longer description), it means we have to go back to this pattern
  • It's relatively unobvious to new people reading the code what the meaning of a plain string is, whereas I think wrapping it in Label gives it immediate meaning to the casual reader

@xinbinhuang
Copy link
Contributor

xinbinhuang commented Apr 2, 2021

If we ever want to add edge information beyond just a label (e.g. add a longer description), it means we have to go back to this pattern

Can you give an example of a long description that has to use Label but not otherwise?

How about task1 >> "this is a label" | task2? This is the exact opposite of the Beam Python SDK example here.

Though I am a bit hesitant to introduce another special operator for Airflow, I think it's still better rather than introducing a Label inheriting from TaskMixin for these 2 reasons:

  • Conceptually, label and task should not be treated the same. The label is metadata but the task is a task. They are different in both UI and execution.
  • In general, I don't like the explicit import of UI components in DAG code. I think DAG code should focus on execution/dependency logic

@andrewgodwin
Copy link
Contributor Author

andrewgodwin commented Apr 2, 2021

I would argue that this fulfills a similar role as task groups - it's a primarily UI-focused feature, but it's also DAG metadata and should live in it. If it's not in the DAG, where would it go where it can appear both in the webserver view and in the output from airflow dags show?

I'm open to changing the user interface here - this is just a rough idea me and @ashb discussed - but I really don't like moving to just a plain str as the thing we use with >> as it's going to make the code significantly more complex to track where everything is supposed to go, especially when combined with XComArgs (which are the model I based this on). Not to mention all the typing for >> expects something inheriting from TaskMixin right now.

I'm also not sure you could make this work:

[t1, t2, t3] >> "label" >> [t4, t5, t6]

@uranusjr
Copy link
Member

uranusjr commented Apr 2, 2021

I'm also not sure you could make this work:

[t1, t2, t3] >> "label" >> [t4, t5, t6]

I don’t think [t1, t2, t3] >> [t4, t5, t6] works either, so that’s not a big problem IMO. But I am currently in camp explicit Label now although I proposed the str syntax; there are things a plain str can’t handle, so a class will be needed sooner or later.

@turbaszek
Copy link
Member

turbaszek commented Apr 2, 2021

When doing operators for adding labels it would be good to check if they work with lineage operators (> and < afaik).

@andrewgodwin would you mind elaborating more about the use case for the labels? In my understanding all the actions are performed by tasks and edges in DAGs represent only order (and sometimes data) relation. Adding a label to edge would suggest that some action is performed between tasks. Is it done by XCom or something?

@andrewgodwin
Copy link
Contributor Author

andrewgodwin commented Apr 2, 2021

The idea for labels is merely to give the user a visual indication of what each edge means - this can be especially important, IMO, when using the branching operators or similar, on larger dags. It has no runtime effect.

I don't think I'd even want to add anything with a runtime effect on edges; Airflow is designed around all the runtime info (priority, etc.) being at the Task level and I like that. The only things I could forsee adding is other informational data, such as a longer "description" field.

The other option to achieve this is to take this, task groups, and any other informational-only parts of airflow and spin them off into a separate presentation layer somehow, and while I do quite like separating presentation and logic, I think that would be too unwieldy to actually be useful.

airflow/models/edgemodifier.py Outdated Show resolved Hide resolved
airflow/utils/types.py Outdated Show resolved Hide resolved
@@ -102,7 +102,8 @@
"_task_group": {"anyOf": [
{ "type": "null" },
{ "$ref": "#/definitions/task_group" }
]}
]},
"edge_info": { "$ref": "#/definitions/dict" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea similar to how we only store _downstream_task_ids (not the _upstream_task_ids) for a task and we can set upstream for the other when de-serializing. We do that to keep the size of serialized blog as small as possible.

if k == "_downstream_task_ids":
v = set(v)

for k, v in encoded_dag.items():
if k == "_downstream_task_ids":
v = set(v)

for task_id in serializable_task.downstream_task_ids:
# Bypass set_upstream etc here - it does more than we want
# noqa: E501 # pylint: disable=protected-access
dag.task_dict[task_id]._upstream_task_ids.add(serializable_task.task_id)

cls.__serialized_fields = frozenset(
vars(BaseOperator(task_id='test')).keys()
- {
'inlets',
'outlets',
'_upstream_task_ids',

We pay the price of hardcoding it but it is worth it as we can save MBs (which would be transmitted to and from the database) when number of tasks is huge.

@dimberman dimberman self-requested a review April 5, 2021 17:59
@andrewgodwin andrewgodwin marked this pull request as ready for review April 8, 2021 17:16
@andrewgodwin andrewgodwin changed the title Add support for labelling DAG edges (WIP) Add support for labelling DAG edges Apr 8, 2021
@andrewgodwin
Copy link
Contributor Author

(ignore it being ready for review, I clicked that before remembering the serialisation needs fixing)

@github-actions
Copy link

github-actions bot commented Apr 8, 2021

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

@andrewgodwin andrewgodwin force-pushed the edge-labels branch 2 times, most recently from bbbdaff to b80d879 Compare April 8, 2021 19:33
@ryanahamilton
Copy link
Contributor

There are a couple of visual tweaks we should add when hovering tasks or statuses (e.g. fading out labels ). I'll provide you a solution as I was just modifying these interactions in #15257.

As well as grouping tasks into groups, you can also label the edges between
different tasks in the Graph View - this can be especially useful for branching
areas of your DAG, so you can label the conditions under which certain branches
might run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably a screenshot in this doc might be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know how I didn't think about that one given it's an entirely visual feature! I'll get it in tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna merge this PR so we don't end up in CI hell, but @andrewgodwin please make another PR with this screenshot if you think that will help

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewgodwin can you also add some Edge Labels to a few of the DAGs in airflow/example_dags/ in the same PR? It would be useful for this feature to be exposed during local development.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 9, 2021
@github-actions
Copy link

github-actions bot commented Apr 9, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

This adds support for putting human-readable labels on edges in the DAG
between Tasks, as well as making the underlying framework for that
generic enough that future metadata could be added if desired.
@ryanahamilton
Copy link
Contributor

I opened #15298 to address the aforementioned visual enhancements needed for this feature.

@dimberman dimberman merged commit 19b74fd into apache:master Apr 9, 2021
kaxil pushed a commit that referenced this pull request Apr 9, 2021
Part of the resolution of #15140 (paired with #15142)

"Edge labels" are an existing concept in the library that powers the Graph view. This feature will be employed in Airflow the Airflow Graph view with #15142. This PR ensures that the labels are displayed properly when interacting with the Task/path/status highlighting features of the Graph. Primarily, it fades out the labels when not relevant.
@andrewgodwin andrewgodwin deleted the edge-labels branch April 9, 2021 17:18
@andrewgodwin
Copy link
Contributor Author

I've opened #15310 to get the extra examples in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:serialization area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow labelling of task dependencies
8 participants