Skip to content

Root Cause Investigation: Memory Growth in LocalExecutor Workers (Scheduler Subprocesses) #56641

@wjddn279

Description

@wjddn279

Apache Airflow version

3.1.0

If "Other Airflow 2/3 version" selected, which one?

No response

What happened?

related to: #55768 (comment)
Memory occupancy continues to rise in scheduler containers

What you think should happen instead?

Memory usage must not rise in airflow deployed by docker compose.

How to reproduce

Test enviroment:

  • airflow 3.1.0 official docker image
  • deployed by docker compose (api-server, scheduler, dag-processor)
  • 100 Dags with 5 PythonOperator running every minute

Operating System

docker container operated in macOs

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

x-airflow-common:
  &airflow-common
  image: apache/airflow:3.1.0
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__AUTH_MANAGER: airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__CORE__EXECUTION_API_SERVER_URL: 'http://airflow-apiserver:8080/execution/'
    AIRFLOW__API__SECRET_KEY: 'abc'
    AIRFLOW__API_AUTH__JWT_SECRET: 'asdasd'
    AIRFLOW__SCHEDULER__ENABLE_TRACEMALLOC: 'false'
  volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
  depends_on:
    &airflow-common-depends-on
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 10s
      retries: 5
      start_period: 5s
    restart: always

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        echo "Creating missing opt dirs if missing:"
        mkdir -v -p /opt/airflow/{logs,dags,plugins,config}
        echo "Airflow version:"
        /entrypoint airflow version
        echo "Running airflow config list to create default config file if missing."
        /entrypoint airflow config list >/dev/null
        echo "Change ownership of files in /opt/airflow to ${AIRFLOW_UID}:0"
        chown -R "${AIRFLOW_UID}:0" /opt/airflow/
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_MIGRATE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
    user: "0:0"
    depends_on:
      <<: *airflow-common-depends-on

  airflow-apiserver:
    <<: *airflow-common
    command: api-server
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/api/v2/version"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD", "airflow", "jobs", "check", "--job-type", "SchedulerJob"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      postgres:
        condition: service_healthy
      airflow-init:
        condition: service_completed_successfully


  airflow-dag-processor:
    <<: *airflow-common
    command: dag-processor
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type DagProcessorJob --hostname "$${HOSTNAME}"']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

Anything else?

As mentioned earlier #55768 (comment), the observed memory increase originates from both the scheduler process and its subprocesses — the LocalExecutor workers.
The scheduler’s own memory growth has already been analyzed and discussed by @kaxil #55768 (comment), so I will not cover it here.

When running with the LocalExecutor, the default number of worker processes is 32.
Since any memory increase per worker is multiplied across all 32 workers, even small leaks can have a critical impact on overall memory usage.

I used Memray to analyze the worker processes (which are child processes of the scheduler) and identified three main causes of excessive memory allocation within them.

1. Importing the k8s client object

First, here is the result of analyzing a single worker process:
memray-flamegraph-output-111.html

In this section, I confirmed that approximately 32 MB of memory is allocated per worker.
Although the code only appears to reference the object’s type, it actually triggers imports of all underlying submodules.
Since each worker imports these modules independently, this results in an additional ~1 GB of total memory allocation across all workers.

2. Increasing memory from client SSL objects

After modifying the problematic code in (1) to prevent the import, I ran memory profiling again.
While the initial memory footprint per worker was significantly reduced, I still observed gradual memory growth over time. (0928 means the stats is reported in 09:28)
remove-k8s-0928.html
remove-k8s-1001.html
remove-k8s-1035.html

In the following section, the SSL initialization appears not to properly release memory.
Within about 30 minutes, a single worker’s memory grew from 8 MB → 23 MB, later exceeding 50 MB, and continued to increase steadily thereafter.

3. Memory inheritance from the parent process due to lazy forking

After addressing issues (1) and (2), I verified that the overall memory consumption remained stable and did not exhibit continuous growth.
However, I noticed that while initial PSS values were low, they gradually increased to relatively high levels over time.
memory_smem.txt

It was difficult to track the exact distribution using Memray due to extensive shared memory usage — very little heap memory remained in the workers themselves.

My hypothesis is as follows:
Unlike Airflow 2.x, version 3.x introduced lazy worker initialization.
As a result, when the scheduler (already holding significant memory) forks a new worker, Copy-on-Write (CoW) causes shared pages to be duplicated across workers, leading to increased per-process memory consumption.

Conclusion

To verify this hypothesis, I modified the code to eagerly spawn worker processes before the scheduler enters its scheduling loop — effectively disabling lazy forking.
The experiment showed that worker memory usage remained stable and no longer exhibited the previous pattern of gradual growth.

memory_smem_2.txt

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    affected_version:3.0Issues Reported for 3.0affected_version:3.1Issues Reported for 3.1area:Schedulerincluding HA (high availability) schedulerarea:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions