Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling based on dataset aliases #40693

Merged
merged 31 commits into from
Jul 22, 2024

Conversation

Lee-W
Copy link
Member

@Lee-W Lee-W commented Jul 10, 2024

Related: #40039

What

In #40478, we introduce a new class DatasetAlias, which allows emitting DatasetEvents or creating Datasets in a task. This PR allows us to schedule a DAG run based on DatasetAlias.

Example

with DAG(dag_id="dataset-producer"):
    @task(outlets=[Dataset("example-alias")])
    def produce_dataset_events():
        pass

with DAG(dag_id="dataset-alias-producer"):
    @task(outlets=[DatasetAlias("example-alias")])
    def produce_dataset_events(*, outlet_events):
        outlet_events["example-alias"].add(Dataset("s3://bucket/my-task"))

with DAG(dag_id="dataset-consumer", schedule=Dataset("s3://bucket/my-task")):
    ...

with DAG(dag_id="dataset-alias-consumer", schedule=DatasetAlias("example-alias")):
    ...

In the example above, before the DAG "dataset-alias-producer" is executed, the dataset alias DatasetAlias("example-alias") is not yet resolved to Dataset("s3://bucket/my-task"). Consequently, completing the execution of the DAG "dataset-producer" will only trigger the DAG "dataset-consumer" and not the DAG "dataset-alias-consumer". However, upon triggering the DAG "dataset-alias-producer", the DatasetAlias("example-alias") will be resolved to Dataset("s3://bucket/my-task"), and it will produce a dataset event that triggers the DAG "dataset-consumer". At this point, DatasetAlias("example-alias") is resolved to Dataset("s3://bucket/my-task"). Therefore, completing the execution of either DAG "dataset-producer" or "dataset-alias-producer" will trigger both the DAG "dataset-consumer" and "dataset-alias-consumer".


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@Lee-W Lee-W force-pushed the schedule-on-dataset-alias branch 10 times, most recently from 341281d to 253c5b0 Compare July 15, 2024 01:47
@Lee-W Lee-W force-pushed the schedule-on-dataset-alias branch 9 times, most recently from ffccb4b to 22d772b Compare July 16, 2024 04:39
@Lee-W Lee-W changed the title Schedule on dataset alias Scheduling based on dataset aliases Jul 16, 2024
@Lee-W Lee-W marked this pull request as ready for review July 16, 2024 07:54
@Lee-W Lee-W requested review from potiuk, kaxil and XD-DENG as code owners July 16, 2024 07:54
Lee-W and others added 20 commits July 22, 2024 19:38
…iter to take dataset_alias"

This reverts commit 22d772b06be7cbfde67ccab6a87569112dec136e.
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
@Lee-W Lee-W force-pushed the schedule-on-dataset-alias branch from e332b66 to 6b1fd6c Compare July 22, 2024 11:42
@Lee-W
Copy link
Member Author

Lee-W commented Jul 22, 2024

All the comments were addressed. Please let me know if anyone wants to take a deeper look. I'm planning on merging this one later today.

@phanikumv phanikumv merged commit 8dff8ae into apache:main Jul 22, 2024
48 checks passed
@phanikumv phanikumv deleted the schedule-on-dataset-alias branch July 22, 2024 13:41
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Jul 22, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.10.0 milestone Jul 23, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants