Make LocalExecutor work under heavy load #47678

ashb · 2025-03-12T15:04:27Z

This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We do not want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham 66968678+jedcunningham@users.noreply.github.com
Co-authored-by: Kaxil Naik kaxilnaik@apache.org

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

jedcunningham

(phew)

airflow/jobs/scheduler_job_runner.py

task-sdk/src/airflow/sdk/execution_time/supervisor.py

ashb · 2025-03-12T15:23:22Z

TaskSDK tests are failing in task sdk. Fix is here #47679

This change seems innocuous, and possibly even wrong, but it is the correct behaviour since #47320 landed. We _do not_ want to call dispose_orm, as that ends up reconnecting, and sometimes this results in the wrong connection being shared between the parent and the child. I don't love the "sometimes" nature of this bug, but the fix seems sound. Prior to this running one or two runs concurrently would result in the scheduler handing (stuck in SQLA code trying to roll back) or an error from psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq". With this change we were able to repeatedly run 10 runs concurrently. The reason we don't want this is that we registered an at_fork handler already that closes/discards the socket object (without closing the DB level session) so calling dispose can, perversely, resurrect that object and try reusing it! Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>

jedcunningham · 2025-03-12T19:28:38Z

Merging. Failure is unrelated.

This change seems innocuous, and possibly even wrong, but it is the correct behaviour since apache#47320 landed. We _do not_ want to call dispose_orm, as that ends up reconnecting, and sometimes this results in the wrong connection being shared between the parent and the child. I don't love the "sometimes" nature of this bug, but the fix seems sound. Prior to this running one or two runs concurrently would result in the scheduler handing (stuck in SQLA code trying to roll back) or an error from psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq". With this change we were able to repeatedly run 10 runs concurrently. The reason we don't want this is that we registered an at_fork handler already that closes/discards the socket object (without closing the DB level session) so calling dispose can, perversely, resurrect that object and try reusing it! Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>

ashb requested review from XD-DENG, hussein-awala, o-nikolas and pierrejeambrun as code owners March 12, 2025 15:04

boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:task-sdk labels Mar 12, 2025

kaxil approved these changes Mar 12, 2025

View reviewed changes

jedcunningham approved these changes Mar 12, 2025

View reviewed changes

ashb commented Mar 12, 2025

View reviewed changes

airflow/jobs/scheduler_job_runner.py Show resolved Hide resolved

ashb commented Mar 12, 2025

View reviewed changes

task-sdk/src/airflow/sdk/execution_time/supervisor.py Show resolved Hide resolved

ashb force-pushed the fix-localexec-high-concurrency branch from 2783431 to ea66c92 Compare March 12, 2025 15:24

ashb force-pushed the fix-localexec-high-concurrency branch from ea66c92 to c48f02a Compare March 12, 2025 17:02

jedcunningham merged commit e1f9151 into main Mar 12, 2025
61 of 62 checks passed

jedcunningham deleted the fix-localexec-high-concurrency branch March 12, 2025 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make LocalExecutor work under heavy load #47678

Make LocalExecutor work under heavy load #47678

Uh oh!

ashb commented Mar 12, 2025

Uh oh!

jedcunningham left a comment

Uh oh!

Uh oh!

Uh oh!

ashb commented Mar 12, 2025

Uh oh!

jedcunningham commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make LocalExecutor work under heavy load #47678

Make LocalExecutor work under heavy load #47678

Uh oh!

Conversation

ashb commented Mar 12, 2025

Uh oh!

jedcunningham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ashb commented Mar 12, 2025

Uh oh!

jedcunningham commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants