Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing Logical Operators for dataset conditional logic #37101

Merged
merged 13 commits into from
Feb 26, 2024

Conversation

sunank200
Copy link
Collaborator

@sunank200 sunank200 commented Jan 30, 2024

We've expanded the dataset dependency handling in Airflow, building on PR #37016. This PR introduces the use of logical operators (| for OR and & for AND) to link Dataset instances, simplifying the expression of complex dependencies.

Key Enhancements:

  • Logical operators | and & for combining Dataset objects.
  • Enhanced URI validation within the Dataset class.
  • New DatasetAny and DatasetAll classes for OR/AND conditions.
  • DatasetsExpression class for building dataset condition trees.
  • extract_datasets function to interpret these expression trees.

Example:

dataset1 = Dataset(uri="s3://bucket1/data1")
dataset2 = Dataset(uri="s3://bucket2/data2")
dataset3 = Dataset(uri="s3://bucket3/data3")
dataset4 = Dataset(uri="s3://bucket4/data4")
dataset5 = Dataset(uri="s3://bucket5/data5")

expr = dataset1 | (dataset2 & dataset3)
# expr translates to DatasetAny(dataset1, DatasetAll(dataset2, dataset3)))

expr1 =  ((dataset1 & dataset2) | dataset3) & (dataset4 | dataset5)

# expr1 translates to DatasetAll(DatasetAny(DatasetAll(dataset1, dataset2), dataset3), DatasetAny(dataset4, dataset5))

This update offers a more intuitive way of expressing dataset dependencies in Airflow workflows.

Depends on the merge of PR #37016.

Dependency Checklist



^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@sunank200 sunank200 changed the title Add dataset conditional syntax Introducing Logical Operators for Dataset Dependencies in Airflow Jan 30, 2024
@sunank200 sunank200 changed the title Introducing Logical Operators for Dataset Dependencies in Airflow Introducing Logical Operators for dataset conditional logic Jan 30, 2024
@sunank200 sunank200 force-pushed the add-dataset-conditional-syntax branch from b4ca8da to ee9de54 Compare January 30, 2024 18:52
airflow/models/dag.py Outdated Show resolved Hide resolved
@dstandish dstandish force-pushed the add-dataset-conditional-syntax branch from ee9de54 to 5189828 Compare January 31, 2024 23:16
@kaxil kaxil added this to the Airflow 2.9.0 milestone Feb 7, 2024
@sunank200 sunank200 force-pushed the add-dataset-conditional-syntax branch 5 times, most recently from c6ce4c7 to caca0f0 Compare February 8, 2024 11:04
@sunank200 sunank200 force-pushed the add-dataset-conditional-syntax branch 3 times, most recently from e95321c to e64450f Compare February 14, 2024 05:52
@sunank200 sunank200 force-pushed the add-dataset-conditional-syntax branch 4 times, most recently from ef19e0f to fedc169 Compare February 21, 2024 13:07
@Lee-W
Copy link
Member

Lee-W commented Feb 22, 2024

As #37016 has been merged, I think we might need to rebase from the main branch and resolve the conflicts before we can review it.

@sunank200 sunank200 force-pushed the add-dataset-conditional-syntax branch 3 times, most recently from 2440c8c to 428087e Compare February 22, 2024 05:49
sunank200 and others added 2 commits February 23, 2024 10:53
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
airflow/datasets/__init__.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
@sunank200 sunank200 force-pushed the add-dataset-conditional-syntax branch from f441e90 to 6b2fdab Compare February 23, 2024 08:49
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to put __and__ and __or__ in the base class instead, and add comment on the optimized __and__ and __or__ for clarity to avoid future readers needing to ask #37101 (comment) again.

@phanikumv phanikumv merged commit 6c7d2c9 into apache:main Feb 26, 2024
59 checks passed
@phanikumv phanikumv deleted the add-dataset-conditional-syntax branch February 26, 2024 08:13
alejorodriguez96 pushed a commit to alejorodriguez96/airflow that referenced this pull request Feb 26, 2024
…7101)

* Implement | and & operators so that they can be used instead of DatasetAll and DatasetAny

* Refactor dataset class inheritance (apache#37590)

* Refactor DatasetAll and DatasetAny inheritance

They are moved from airflow.models.datasets to airflow.datasets since
the intention is to use them with Dataset, not DatasetModel. It is more
natural for users to import from the latter module instead.

A new (abstract) base class is added for the two classes, plus the OG
Dataset class, to inherit from. This allows us to replace a few
isinstance checks with simple molymorphism and make the logic a bit
simpler.

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: Wei Lee <weilee.rx@gmail.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
alejorodriguez96 pushed a commit to alejorodriguez96/airflow that referenced this pull request Feb 26, 2024
…7101)

* Implement | and & operators so that they can be used instead of DatasetAll and DatasetAny

* Refactor dataset class inheritance (apache#37590)

* Refactor DatasetAll and DatasetAny inheritance

They are moved from airflow.models.datasets to airflow.datasets since
the intention is to use them with Dataset, not DatasetModel. It is more
natural for users to import from the latter module instead.

A new (abstract) base class is added for the two classes, plus the OG
Dataset class, to inherit from. This allows us to replace a few
isinstance checks with simple molymorphism and make the logic a bit
simpler.

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: Wei Lee <weilee.rx@gmail.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants