Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Mar 12, 2025

This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We do not want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham 66968678+jedcunningham@users.noreply.github.com
Co-authored-by: Kaxil Naik kaxilnaik@apache.org


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:task-sdk labels Mar 12, 2025
Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(phew)

@ashb
Copy link
Member Author

ashb commented Mar 12, 2025

TaskSDK tests are failing in task sdk. Fix is here #47679

@ashb ashb force-pushed the fix-localexec-high-concurrency branch from 2783431 to ea66c92 Compare March 12, 2025 15:24
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
@ashb ashb force-pushed the fix-localexec-high-concurrency branch from ea66c92 to c48f02a Compare March 12, 2025 17:02
@jedcunningham
Copy link
Member

Merging. Failure is unrelated.

@jedcunningham jedcunningham merged commit e1f9151 into main Mar 12, 2025
61 of 62 checks passed
@jedcunningham jedcunningham deleted the fix-localexec-high-concurrency branch March 12, 2025 19:28
nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Apr 4, 2025
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since apache#47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants