Root Cause Investigation: Memory Growth in LocalExecutor Workers (Scheduler Subprocesses)

### Apache Airflow version

3.1.0

### If "Other Airflow 2/3 version" selected, which one?

_No response_

### What happened?

related to: https://github.com/apache/airflow/issues/55768#issuecomment-3402928673
Memory occupancy continues to rise in scheduler containers

### What you think should happen instead?

Memory usage must not rise in airflow deployed by docker compose.


### How to reproduce

Test enviroment:
- airflow 3.1.0 official docker image
- deployed by docker compose (api-server, scheduler, dag-processor)
- 100 Dags with 5 PythonOperator running every minute

### Operating System

docker container operated in macOs 

### Versions of Apache Airflow Providers

_No response_

### Deployment

Docker-Compose

### Deployment details

````
x-airflow-common:
  &airflow-common
  image: apache/airflow:3.1.0
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__AUTH_MANAGER: airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__CORE__EXECUTION_API_SERVER_URL: 'http://airflow-apiserver:8080/execution/'
    AIRFLOW__API__SECRET_KEY: 'abc'
    AIRFLOW__API_AUTH__JWT_SECRET: 'asdasd'
    AIRFLOW__SCHEDULER__ENABLE_TRACEMALLOC: 'false'
  volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
  depends_on:
    &airflow-common-depends-on
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 10s
      retries: 5
      start_period: 5s
    restart: always

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        echo "Creating missing opt dirs if missing:"
        mkdir -v -p /opt/airflow/{logs,dags,plugins,config}
        echo "Airflow version:"
        /entrypoint airflow version
        echo "Running airflow config list to create default config file if missing."
        /entrypoint airflow config list >/dev/null
        echo "Change ownership of files in /opt/airflow to ${AIRFLOW_UID}:0"
        chown -R "${AIRFLOW_UID}:0" /opt/airflow/
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_MIGRATE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
    user: "0:0"
    depends_on:
      <<: *airflow-common-depends-on

  airflow-apiserver:
    <<: *airflow-common
    command: api-server
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/api/v2/version"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD", "airflow", "jobs", "check", "--job-type", "SchedulerJob"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      postgres:
        condition: service_healthy
      airflow-init:
        condition: service_completed_successfully


  airflow-dag-processor:
    <<: *airflow-common
    command: dag-processor
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type DagProcessorJob --hostname "$${HOSTNAME}"']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully
````

### Anything else?

As mentioned earlier https://github.com/apache/airflow/issues/55768#issuecomment-3374894204, the observed memory increase originates from both the scheduler process and its subprocesses — the LocalExecutor workers.
The scheduler’s own memory growth has already been analyzed and discussed by @kaxil https://github.com/apache/airflow/issues/55768#issuecomment-3353598174, so I will not cover it here.

When running with the LocalExecutor, the default number of worker processes is 32.
Since any memory increase per worker is multiplied across all 32 workers, even small leaks can have a critical impact on overall memory usage.

I used Memray to analyze the worker processes (which are child processes of the scheduler) and identified three main causes of excessive memory allocation within them.

### 1. Importing the k8s client object

First, here is the result of analyzing a single worker process:
[memray-flamegraph-output-111.html](https://github.com/user-attachments/files/22916972/memray-flamegraph-output-111.html)

[In this section](https://github.com/apache/airflow/blob/main/shared/secrets_masker/src/airflow_shared/secrets_masker/secrets_masker.py#L164C10-L164C47), I confirmed that approximately 32 MB of memory is allocated per worker.
Although the code only appears to reference the object’s type, it actually triggers imports of all underlying submodules.
Since each worker imports these modules independently, this results in an additional ~1 GB of total memory allocation across all workers.

### 2. Increasing memory from client SSL objects

After modifying the problematic code in (1) to prevent the import, I ran memory profiling again.
While the initial memory footprint per worker was significantly reduced, I still observed gradual memory growth over time. (0928 means the stats is reported in 09:28)
[remove-k8s-0928.html](https://github.com/user-attachments/files/22916986/remove-k8s-0928.html)
[remove-k8s-1001.html](https://github.com/user-attachments/files/22916988/remove-k8s-1001.html)
[remove-k8s-1035.html](https://github.com/user-attachments/files/22916990/remove-k8s-1035.html)

[In the following section](https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/api/client.py#L828), the SSL initialization appears not to properly release memory.
Within about 30 minutes, a single worker’s memory grew from 8 MB → 23 MB, later exceeding 50 MB, and continued to increase steadily thereafter.

### 3. Memory inheritance from the parent process due to lazy forking

After addressing issues (1) and (2), I verified that the overall memory consumption remained stable and did not exhibit continuous growth.
However, I noticed that while initial PSS values were low, they gradually increased to relatively high levels over time.  
[memory_smem.txt](https://github.com/user-attachments/files/22917005/memory_smem.txt)

It was difficult to track the exact distribution using Memray due to extensive shared memory usage — very little heap memory remained in the workers themselves.

My hypothesis is as follows:
Unlike Airflow 2.x, version 3.x introduced lazy worker initialization.
As a result, when the scheduler (already holding significant memory) forks a new worker, Copy-on-Write (CoW) causes shared pages to be duplicated across workers, leading to increased per-process memory consumption.

### Conclusion

To verify this hypothesis, I modified the code to eagerly spawn worker processes before the scheduler enters its scheduling loop — effectively disabling lazy forking.
The experiment showed that worker memory usage remained stable and no longer exhibited the previous pattern of gradual growth.

[memory_smem_2.txt](https://github.com/user-attachments/files/22917012/memory_smem_2.txt)

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Root Cause Investigation: Memory Growth in LocalExecutor Workers (Scheduler Subprocesses) #56641

Apache Airflow version

If "Other Airflow 2/3 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

1. Importing the k8s client object

2. Increasing memory from client SSL objects

3. Memory inheritance from the parent process due to lazy forking

Conclusion

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Root Cause Investigation: Memory Growth in LocalExecutor Workers (Scheduler Subprocesses) #56641

Description

Apache Airflow version

If "Other Airflow 2/3 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

1. Importing the k8s client object

2. Increasing memory from client SSL objects

3. Memory inheritance from the parent process due to lazy forking

Conclusion

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions