add pre loading of airflow module in dag-processor using runtime diff #58890
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
While investigating fork-related behavior, I found that loading modules that were not imported in the parent process—and therefore had to be loaded each time a child process was spawned—caused significant performance degradation.
Although some optimizations have already been applied to the dag-processor, I would like to propose an additional improvement.
The existing optimization performs static analysis using AST on DAG files and preloads modules declared in import statements. This yields good results for a single DAG. But It cannot detect cases where modules are imported indirectly through nested calls like the following.
As the use of dynamic DAGs has become more common, we need an approach that can handle these cases, which are difficult to detect through static analysis alone. For this reason, I believe comparing which modules are actually imported at runtime is a more effective method, and this PR implements that approach.
Question
As mentioned in the previous PR, it seemed that preloading heavy modules (e.g. numpy, k8s, pandas) was also being considered. However, based on the current code, it does not appear that these modules are actually being preloaded.
What are your thoughts on allowing users to define modules they want to preload via configuration, and enabling preloading specifically for those modules?
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.