-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
3.1.0
If "Other Airflow 2/3 version" selected, which one?
No response
What happened?
related to: #55768 (comment)
Memory occupancy continues to rise in scheduler containers
What you think should happen instead?
Memory usage must not rise in airflow deployed by docker compose.
How to reproduce
Test enviroment:
- airflow 3.1.0 official docker image
- deployed by docker compose (api-server, scheduler, dag-processor)
- 100 Dags with 5 PythonOperator running every minute
Operating System
docker container operated in macOs
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
x-airflow-common:
&airflow-common
image: apache/airflow:3.1.0
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__AUTH_MANAGER: airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__CORE__EXECUTION_API_SERVER_URL: 'http://airflow-apiserver:8080/execution/'
AIRFLOW__API__SECRET_KEY: 'abc'
AIRFLOW__API_AUTH__JWT_SECRET: 'asdasd'
AIRFLOW__SCHEDULER__ENABLE_TRACEMALLOC: 'false'
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
depends_on:
&airflow-common-depends-on
postgres:
condition: service_healthy
services:
postgres:
image: postgres:13
ports:
- "5432:5432"
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 10s
retries: 5
start_period: 5s
restart: always
airflow-init:
<<: *airflow-common
entrypoint: /bin/bash
command:
- -c
- |
echo "Creating missing opt dirs if missing:"
mkdir -v -p /opt/airflow/{logs,dags,plugins,config}
echo "Airflow version:"
/entrypoint airflow version
echo "Running airflow config list to create default config file if missing."
/entrypoint airflow config list >/dev/null
echo "Change ownership of files in /opt/airflow to ${AIRFLOW_UID}:0"
chown -R "${AIRFLOW_UID}:0" /opt/airflow/
environment:
<<: *airflow-common-env
_AIRFLOW_DB_MIGRATE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
_PIP_ADDITIONAL_REQUIREMENTS: ''
user: "0:0"
depends_on:
<<: *airflow-common-depends-on
airflow-apiserver:
<<: *airflow-common
command: api-server
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/api/v2/version"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-scheduler:
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD", "airflow", "jobs", "check", "--job-type", "SchedulerJob"]
interval: 30s
timeout: 10s
retries: 5
restart: always
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
airflow-dag-processor:
<<: *airflow-common
command: dag-processor
healthcheck:
test: ["CMD-SHELL", 'airflow jobs check --job-type DagProcessorJob --hostname "$${HOSTNAME}"']
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
Anything else?
As mentioned earlier #55768 (comment), the observed memory increase originates from both the scheduler process and its subprocesses — the LocalExecutor workers.
The scheduler’s own memory growth has already been analyzed and discussed by @kaxil #55768 (comment), so I will not cover it here.
When running with the LocalExecutor, the default number of worker processes is 32.
Since any memory increase per worker is multiplied across all 32 workers, even small leaks can have a critical impact on overall memory usage.
I used Memray to analyze the worker processes (which are child processes of the scheduler) and identified three main causes of excessive memory allocation within them.
1. Importing the k8s client object
First, here is the result of analyzing a single worker process:
memray-flamegraph-output-111.html
In this section, I confirmed that approximately 32 MB of memory is allocated per worker.
Although the code only appears to reference the object’s type, it actually triggers imports of all underlying submodules.
Since each worker imports these modules independently, this results in an additional ~1 GB of total memory allocation across all workers.
2. Increasing memory from client SSL objects
After modifying the problematic code in (1) to prevent the import, I ran memory profiling again.
While the initial memory footprint per worker was significantly reduced, I still observed gradual memory growth over time. (0928 means the stats is reported in 09:28)
remove-k8s-0928.html
remove-k8s-1001.html
remove-k8s-1035.html
In the following section, the SSL initialization appears not to properly release memory.
Within about 30 minutes, a single worker’s memory grew from 8 MB → 23 MB, later exceeding 50 MB, and continued to increase steadily thereafter.
3. Memory inheritance from the parent process due to lazy forking
After addressing issues (1) and (2), I verified that the overall memory consumption remained stable and did not exhibit continuous growth.
However, I noticed that while initial PSS values were low, they gradually increased to relatively high levels over time.
memory_smem.txt
It was difficult to track the exact distribution using Memray due to extensive shared memory usage — very little heap memory remained in the workers themselves.
My hypothesis is as follows:
Unlike Airflow 2.x, version 3.x introduced lazy worker initialization.
As a result, when the scheduler (already holding significant memory) forks a new worker, Copy-on-Write (CoW) causes shared pages to be duplicated across workers, leading to increased per-process memory consumption.
Conclusion
To verify this hypothesis, I modified the code to eagerly spawn worker processes before the scheduler enters its scheduling loop — effectively disabling lazy forking.
The experiment showed that worker memory usage remained stable and no longer exhibited the previous pattern of gradual growth.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct