Optimize DAG run scheduling based on dataset triggers and batching #37707

sunank200 · 2024-02-26T13:31:58Z

This PR introduces optimizations to the dags_needing_dagruns method in the DagModel class. The changes include the implementation of batch processing to efficiently handle large sets of DAG IDs. The motivation behind this change is to address the performance issues associated with processing a large number of DAGs, which can lead to significant memory usage and slow down the scheduler.

Changes:

Batch Processing: The method now processes DatasetDagRunQueue records in batches, reducing memory usage and improving efficiency. This approach minimizes the overhead of loading and processing large numbers of DAGs simultaneously.
Error Handling: Improved the handling of cases where DAG serialization versions are outdated or when no dataset trigger records are found for a given DAG.

Depends on the merge of PR #37016. and #37101

Dependency Checklist

PR Add conditional logic for dataset triggering #37016 should be merged before this PR.
PR Introducing Logical Operators for dataset conditional logic #37101 should be merged before this PR.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

sunank200 · 2024-02-28T17:30:25Z

Based on conversation with @dstandish - we can park this PR and work on more pressing issue first

sunank200 · 2024-03-20T09:36:04Z

Based on the conversation with @vatsrahul1001 this approach did not improve the performance hence not going ahead with this approach.

sunank200 mentioned this pull request Feb 26, 2024

Optimize dags_needing_dagruns Method with Batch Processing #37368

Closed

2 tasks

sunank200 force-pushed the optimize_dags_needing_dagruns_with_batch branch from d685fd5 to 479d77a Compare February 26, 2024 13:33

Optimize DAG run scheduling based on dataset triggers and batching

493a934

sunank200 force-pushed the optimize_dags_needing_dagruns_with_batch branch from 479d77a to 493a934 Compare February 27, 2024 05:01

phanikumv requested review from dstandish, uranusjr and jedcunningham February 28, 2024 09:42

sunank200 closed this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize DAG run scheduling based on dataset triggers and batching #37707

Optimize DAG run scheduling based on dataset triggers and batching #37707

sunank200 commented Feb 26, 2024 •

edited

Loading

sunank200 commented Feb 28, 2024

sunank200 commented Mar 20, 2024

Optimize DAG run scheduling based on dataset triggers and batching #37707

Optimize DAG run scheduling based on dataset triggers and batching #37707

Conversation

sunank200 commented Feb 26, 2024 • edited Loading

Dependency Checklist

sunank200 commented Feb 28, 2024

sunank200 commented Mar 20, 2024

sunank200 commented Feb 26, 2024 •

edited

Loading