Skip to content

Conversation

@wjddn279
Copy link
Contributor

@wjddn279 wjddn279 commented Dec 1, 2025

Motivation

While investigating fork-related behavior, I found that loading modules that were not imported in the parent process—and therefore had to be loaded each time a child process was spawned—caused significant performance degradation.

Although some optimizations have already been applied to the dag-processor, I would like to propose an additional improvement.

The existing optimization performs static analysis using AST on DAG files and preloads modules declared in import statements. This yields good results for a single DAG. But It cannot detect cases where modules are imported indirectly through nested calls like the following.

# dags/dag_files.py
from plugins.generate_dag import create_dag

for name in names: 
    create_dag(name)
# plugins/generate_dag.py

# load heavy module of airflow in here
from airflow.module import ~
from airflow.load import ~
from airflow.dags impot Dag

def create_dag(name):
    with DAG(
        dag_id,
    ) as dag:
    task1 >> task2 >> task3

    return dag

As the use of dynamic DAGs has become more common, we need an approach that can handle these cases, which are difficult to detect through static analysis alone. For this reason, I believe comparing which modules are actually imported at runtime is a more effective method, and this PR implements that approach.

Question

As mentioned in the previous PR, it seemed that preloading heavy modules (e.g. numpy, k8s, pandas) was also being considered. However, based on the current code, it does not appear that these modules are actually being preloaded.

What are your thoughts on allowing users to define modules they want to preload via configuration, and enabling preloading specifically for those modules?


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation! Current re-import approach LGTM.

If I understand correct that DagFileProcessorManager process will import not_loaded_airflow_modules and CoW should prevent DagFileProcessorProcess child processes will not import those modules again. Is there any way to verify the CoW does prevent the importing in Child processes (e.g. benchmarking, debug logging or any further unit tests)? Thanks!

As mentioned in the #50371 (comment), it seemed that preloading heavy modules (e.g. numpy, k8s, pandas) was also being considered. However, based on the current code, it does not appear that these modules are actually being preloaded.

What are your thoughts on allowing users to define modules they want to preload via configuration, and enabling preloading specifically for those modules?

Yes, #50371 didn't support the pre-import for heavy modules out of Airflow. Maybe we should introduce a new config for user to specifying custom modules to import.

Or change the [dag_processor] parsing_pre_import_modules config type from boolean to string (which also more align the name of config itself IMO), but it's a breaking change so I'm not sure if it's a good idea.

@jason810496 jason810496 requested review from ashb and potiuk December 1, 2025 13:40
@potiuk
Copy link
Member

potiuk commented Dec 1, 2025

I do not think we should invest in it.

This feature will stop being needed pretty much completely after the full isolation of task-sdk from airflow-core.

Once this is completed, the only modules that Dags will be able to import from airflow. will be "airflow.sdk` - which in essence can all be always pre-imported before in DagFileProcessorManager. Even if we will be loading some "airflow." stuff - this will be actually loaded from deprecated redirections to "airflow.sdk".

I would say it's more worthwile to introduce a completely new feature where you can indeed specify a string list of which modules should be preloaded in DagFileProcessor. But it should be new config, new behaviour, new feature.

Actually it would also be great if it can be paired with the COW improvement. All this pre-loading of specified "heavy" modules could be done at the very beginning of of DagFileProcessor start with the "gc" dance to load all the specified modules and freeze()/unfreeze() every time ne DagFileProcessor parser is started.

@wjddn279
Copy link
Contributor Author

wjddn279 commented Dec 2, 2025

I hadn't fully considered the direction of the task-sdk separation — thank you for pointing that out!

As you mentioned, a user-defined module preload feature seems much more appropriate and aligned with the future architecture. Before that, we’ll need to discuss and implement the gc.freeze / unfreeze cycle for the dag-processor, so I’ll revisit this idea after that work has progressed.

I’ll go ahead and close this PR for now.
Thanks again for the thoughtful review and guidance! @potiuk @jason810496

@wjddn279 wjddn279 closed this Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants