Skip to content

Conversation

@wjddn279
Copy link
Contributor

@wjddn279 wjddn279 commented Sep 24, 2025

related: #56879

Solutions

There are three possible approaches:

  1. Keep a strong reference to connections in a dictionary to prevent GC.
  2. Disable GC in subprocesses.
  3. Avoid calling dispose; instead, create a new engine in the subprocess.

Among these, (3) is the cleanest and aligns with SQLAlchemy’s official recommendation: create a new engine and session objects when forking processes.

By replacing dispose() with logic to create a new engine in subprocesses, I confirmed that existing connections were no longer finalized and the bug disappeared.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@shahar1 shahar1 changed the title Fix: Prevent premature MySQL disconnect caused by GC after dispose Prevent premature MySQL disconnect caused by GC after dispose Sep 24, 2025
@shahar1 shahar1 added area:core type:bug-fix Changelog: Bug Fixes area:MetaDB Meta Database related issues. labels Sep 24, 2025
@wjddn279
Copy link
Contributor Author

wjddn279 commented Oct 1, 2025

@shahar1
(I hope you don’t mind me mentioning you directly here, as you were tagged on this PR.)

Hello! Can I get the review for this problem?
This problem makes our system break, so I hope the next version will reflect a fixed version of this problem.

@shahar1
Copy link
Contributor

shahar1 commented Oct 1, 2025

@shahar1 (I hope you don’t mind me mentioning you directly here, as you were tagged on this PR.)

Hello! Can I get the review for this problem? This problem makes our system break, so I hope the next version will reflect a fixed version of this problem.

I don't mind mentioning me directly in general, but in this case I'd rather have someone else to review it as it is not my area of expertise (yet) :)
Please note that currently two CI checks are failing.

@jroachgolf84
Copy link
Collaborator

@wjddn279, were you able to write any unit-tests to validate this functionality?

@wjddn279
Copy link
Contributor Author

@jroachgolf84

Of course. I’ll add unit tests to verify the following behaviors in this code:

  • GC triggered by dispose on the existing connection
  • Unintended MySQL disconnection caused by GC

I’ll think about a few more cases and leave a mention after writing them.

@wjddn279
Copy link
Contributor Author

@jroachgolf84
I have added some tests!


connect_args = _get_connect_args("sync")
if SQL_ALCHEMY_CONN.startswith("sqlite"):
connect_args["check_same_thread"] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to keep the comment.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2025

I am kind of hesitant to make such a change which goes against (or rather not aligining with) the recommendations of sqlalchemy docs.

Looking at the docs (and following what @tirkarthi wrote in #56879 (comment) -> maybe a better solution here will be to (also) follow the third option from https://docs.sqlalchemy.org/en/20/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork ?

  1. Call Engine.dispose() directly before the child process is created. This will also cause the child process to start with a new connection pool, while ensuring the parent connections are not transferred to the child process:
engine = create_engine("mysql://user:pass@host/dbname")

def run_in_process():
     with engine.connect() as conn:
         conn.execute(text("..."))

# before process starts, ensure engine.dispose() is called
engine.dispose()
p = Process(target=run_in_process)
p.start()

This will dispose (with default close=True) all the pooled connections before the fork and this will pretty much automatically make sure that there is no garbage-colleciton induced event sending QUIT (because this will happen synchronously in the parent's dispose command). That means that the parent process will have to re-create the connections, yes but that's an acceptable overhead. And we could do it only for mysql dialect and still use dispose(close=False) in the forks (no harm by closing not opened connections).

This would keep it more aligned with the recommendations from sqlalchemy docs, while targetting the apparent misbehaviour of mysql driver (which BTW - we should likely create an issue for in mysql driver if not done already - because it seems that the buggy behaviour is in the driver itself). It would also avoid the custom behaviours - both dispose(close=True) in parent and dispose(close=False) in the forks are recommended and "standard" behaviours that sqlalchemy expects and recommends.

@wjddn279
Copy link
Contributor Author

wjddn279 commented Oct 21, 2025

@potiuk
Thank you for your response.

I agree with your approach that, specifically for MySQL, disconnecting all pooled connections before forking can ensure connection safety in forked processes. This is a solution that hadn't occurred to me.

However, I still have the following concerns:

Disposing of the engine means temporarily disconnecting all connections in the pool, which could potentially trigger race conditions that are difficult to predict. I believe that operations with significant side effects, such as connection closing, should be performed at precisely controlled and intentional points (such as explicit connection closure upon program termination or explicit blocking of DB connections in workers) rather than being executed non-deterministically. While this is recommended in the official documentation, I'm not sure if it was designed with highly multi-process environments like Airflow in mind.

The official documentation states the following:
image

These four approaches are described as guard logic that should be implemented when using an Engine object as-is in child processes. The approach I've implemented in this PR completely creates a new Engine object in the child process, and by creating a new object, I understood that the situation has changed such that we no longer need to follow the documentation's recommendations. Additionally, by creating a new Session object, we can eliminate the risk factors mentioned below.

image

As you mentioned, the approach I've adopted doesn't appear to be explicitly mentioned in the official documentation, and I believe your suggested solution would also clearly resolve the issue. It's certainly possible to re-modify the code by adopting the approach you and @tirkarthi suggested. However, I would like to hear more of your thoughts on my reasoning for the solution.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2025

I think we shoudl also understand the context better. We are not talking about "generic" utility but about specific ways Airflow uses forking where forking interacts with SQLAlchemy DB initialization.

There are several places where we fork in Airflow (and in Airflow 3 this is quite a bit different than in Airflow 2):

  1. Workers. Here, the thing is that worker does not even have connection configured. It's not supposed to access the DB at all. So we have no issue here.

  2. Local Executor - > here executors are only forked at the start of the scheduler in a single process that forks/spawns executor processes, so clossing the DB conneciton before that is quite OK - because it is long before anything else happens and there are (I think) no side effects of it.

  3. Dag Processor - again DB There is only initialized in the internal API server. The Processor Manager itself and Dag Processor parsing processes should not even attempt to initialize database.

  4. Triggerer - same as DagProcessor. It should not access the DB in the process that forks sub-processes.

  5. API server - there I am not 100% sure what is the mechanism used by uvicorn now, but I would not see the need for the main process to use database at all, database should also only be inititalized and used in the forks (which are worker processes that handle async loops of starlette - I guess)

Now assumptions 3), 4) and 5) should be verified, if that's the case for sure (and fixed if not) - which leaves only Scheduler case where both forking and forked process need a database. I think once we verify those assumptions, your concern about

Disposing of the engine means temporarily disconnecting all connections in the pool, which could potentially trigger race conditions that are difficult to predict.

might not be valid - and then the only thing we really need to do is dispose connections before we fork Local Executor processes and re-create the pools.

@tirkarthi
Copy link
Contributor

Just to add I think this will be simplified or solved once DB access from dag-processor also moved to use task-sdk. I also want to add passing echo_pool=True displays verbose logs about the lifecycle of the pool which might be useful here.

#51552

I was also checking on other options

  1. Option 1 to use NullPool when mysqlclient is used but will result in performance impact that sqlalchemy docs describes as modest.
  2. Use pymysql which is a Pure Python implementation and might not have this behaviour though again not as performant mysqlclient in C.

@tirkarthi
Copy link
Contributor

In Airflow 2.11 the orm was configured per dag file processor which is similar to the implementation proposed here referencing issues with multiple process in comment.

settings.configure_orm()

@wjddn279
Copy link
Contributor Author

@potiuk
Thank you for your response. I should have based my thinking on actual use cases rather than abstract concerns—I was lacking in that regard. Thank you for the detailed explanation.

To summarize what you've mentioned (for the benefit of others who may review this later), we can classify the cases as follows:

  1. Parent process does not use DB → No connections are created in the first place, so no issue arises
  2. Parent uses DB but child process does not use DB → Since the child doesn't use it, there's no need to dispose() on fork
  3. Both parent and child processes use DB → Race conditions 'could potentially occur' (not if query execution is synchronous)

For workers, dag-processor, and triggerer, cases 1 or 2 clearly apply, and for the api-server, case 1 also clearly applies since it spawns workers.

Therefore, only the scheduler when using Local Executor, which falls under case 3, requires verification. However, since all queries executed in the main process (scheduler) are performed synchronously, it's evident that no race condition exists. Ultimately, I understand you're saying that concerns about race conditions in this case are not warranted within the Airflow system.

The solution will be to not dispose on fork for MySQL, but instead apply engine.dispose(close=True) for Local Executor. If you had a different intention in mind, please let me know. Thank you.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2025

@potiuk Thank you for your response. I should have based my thinking on actual use cases rather than abstract concerns—I was lacking in that regard. Thank you for the detailed explanation.

To summarize what you've mentioned (for the benefit of others who may review this later), we can classify the cases as follows:

  1. Parent process does not use DB → No connections are created in the first place, so no issue arises
  2. Parent uses DB but child process does not use DB → Since the child doesn't use it, there's no need to dispose() on fork
  3. Both parent and child processes use DB → Race conditions 'could potentially occur' (not if query execution is synchronous)

For workers, dag-processor, and triggerer, cases 1 or 2 clearly apply, and for the api-server, case 1 also clearly applies since it spawns workers.

Therefore, only the scheduler when using Local Executor, which falls under case 3, requires verification. However, since all queries executed in the main process (scheduler) are performed synchronously, it's evident that no race condition exists. Ultimately, I understand you're saying that concerns about race conditions in this case are not warranted within the Airflow system.

The solution will be to not dispose on fork for MySQL, but instead apply engine.dispose(close=True) for Local Executor. If you had a different intention in mind, please let me know. Thank you.

Question (just curious) - are you using some AI to generate those responses? They seem very repetitive and seem to echo back what has been written (Which is very much what the AI /LLMS do). It is quite good when human tries to paraphrase things in their own workd - to make sure they understand things - but those kind of repetitions don't seem to add much value - especially if they are automatically generated.

Or am I wrong?

@wjddn279
Copy link
Contributor Author

@potiuk
No, I only use AI for translating my responses into English, nothing else. I understand the response may seem repetitive, but given the complexity of the content, I simply summarized it to confirm at once whether there's any confusion in what I've understood.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2025

@potiuk No, I only use AI for translating my responses into English, nothing else. I understand the response may seem repetitive, but given the complexity of the content, I simply summarized it to confirm at once whether there's any confusion in what I've understood.

Then cool! Yeah. I agree paraphrasing things by humans is a good idea !

@potiuk
Copy link
Member

potiuk commented Oct 21, 2025

Therefore, only the scheduler when using Local Executor, which falls under case 3, requires verification. However, since all queries executed in the main process (scheduler) are performed synchronously, it's evident that no race condition exists. Ultimately, I understand you're saying that concerns about race conditions in this case are not warranted within the Airflow system.

Yep. Exactly. Also I think we should verify all the above assumptions - I am not 100% if everything I wrote is correct. There are some things added recently - the internal api server in Dag processor and triggerer, also Airlfow orm initialization might happen in the Dag Processor and triggerer by importing stuff - somewhat accidentally - so I think some more checking should be done to verify if the scenarios i described are 100% correct.

@wjddn279
Copy link
Contributor Author

wjddn279 commented Oct 21, 2025

@potiuk
Sounds good. I'll think about how to actually verify this. In addition to ORM initialization, I'll also check for any unintended query operations in child processes.

Before that, I have a question. Does the DB usage pattern we've classified for each component (whether parent/child processes use DB) reflect a design philosophy? For example, in the dag-processor, child processes don't connect to the DB but instead parse files and pass objects to the parent process. I'd like to know if this is based on a design principle of "dag processor child processes should never directly connect to the DB," or if it's just how it's currently implemented and could change in the future.

The reason I'm asking is that our current verification might become invalid later and could potentially lead to new bugs. If this DB usage pattern is not absolute, then the uniform but absolute prevention approaches mentioned by @tirkarthi (NullPool, change driver, recreate engine) could be better long-term solutions.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2025

The reason I'm asking is that our current verification might become invalid later and could potentially lead to new bugs. If this DB usage pattern is not absolute, then the uniform but absolute prevention approaches mentioned by @tirkarthi (NullPool, change driver, recreate engine) could be better long-term solutions.

Yes - that's the current design philosophy, If it changes - it will go further into isolation - i.e. both Triggerer and Dag processor eventually should not access DB at all - all the communication should go through the api-server. Currently it's a bit balance between performance and security but in the future, complete isolation will become more important.

@EugeneChung
Copy link

Trying to use Airflow 3.1.0 these days, we are facing the same issue with MySQL 8 (AWS Aurora 3). How is the progress?

@wjddn279
Copy link
Contributor Author

wjddn279 commented Nov 5, 2025

@EugeneChung

I’m working on addressing the issue with a focus on the long-term direction, so it may take some time before the solution is fully reflected. However, If you know how to run Airflow with some code changes, I can suggest a simple modification to address the issue.

@EugeneChung
Copy link

@wjddn279 Sure. I can apply a diff file. Thanks in advance!

@wjddn279
Copy link
Contributor Author

wjddn279 commented Nov 5, 2025

@EugeneChung

The root cause of the problem occurs when a connection object that lost its reference in a forked process gets garbage collected and explicitly quits the connection (only in the case of MySQL). A hacky but simplest fix is to keep the object as a key in a dictionary to explicitly maintain its reference.

wjddn279@c522ce1
The object is added to the dictionary when it connects and removed when it closes.

In my case, it works well. Please let me know if it doesn't resolve the issue for you.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 21, 2025
@AmosG
Copy link

AmosG commented Dec 21, 2025

@wjddn279 @potiuk

This PR seems critical for anyone running Airflow 3 on MySQL. But we've have also been tracking systemic OperationalError (Lost connection) and PendingRollbackError (500) issues in production on 3.1.5, and this fix addresses the primary 'trigger' for those failures.

The Idea :
Based on :
#57815
#57065
and
#57981

In Airflow 3, It seems to me that FastAPI server and legacy FAB UI share the same process and thread-local scoped_session, the 'COM_QUIT' sent by child processes during fork/GC (which this PR fixes) doesn't just cause a one-off error. Instead, it poisons the entire worker thread.

And FastAPI currently lacks a global session teardown (like Flask's @app.teardown_appcontext), once a connection is lost via the mechanism you've identified here, that thread's settings.Session remains in an invalid state. This leads to persistent 500 errors on completely unrelated requests (like /login/ or /api/v2/version) until the process is restarted.

Impact of this PR:
By preventing the premature COM_QUIT on child process disposal, this PR removes the most frequent source of session poisoning for MySQL users.

Recommendation:
We should definitely merge this to stop the 'bleeding' for MySQL deployments.

BUT
Additionally, we are recommending a systemic fix to add a global SessionMiddleware to FastAPI that calls settings.Session.remove() after every request. This would provide a second layer of defense, ensuring that even if a connection is lost for other reasons (network hiccup, timeout), the worker thread can 'self-heal' instead of entering a 500 loop.

I can provide an initial PR to help explain
Please ACK

@wjddn279
Copy link
Contributor Author

wjddn279 commented Dec 21, 2025

@AmosG

Yes, regarding this issue, we are currently discussing an approach where, in the DAG processor, garbage collection is disabled for objects created before spawning subprocesses, in order to prevent GC from collecting the database connection objects.

However, the API server is unlikely to be affected by this issue. As summarized in
#56044 (comment)
, although the API server does spawn subprocesses, the parent process does not use the corresponding db connections, so it is not impacted by this behavior. The examples you mentioned also all occur on PostgreSQL, so they appear to be a separate issue.

We will try to move forward with a patch in this direction as quickly as possible. In the meantime, you may refer to and apply the changes I previously suggested as a workaround. #56044 (comment)

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 22, 2025
@msumit
Copy link
Contributor

msumit commented Jan 6, 2026

@ wjddn279, do you know if there is any final decision on the approach here? We are also facing the same issue, and by applying your PR changes, it gets fixed. The workaround doesn't seem that intuitive to me, so it's better to have a proper fix.

@potiuk
Copy link
Member

potiuk commented Jan 6, 2026

@ wjddn279, do you know if there is any final decision on the approach here? We are also facing the same issue, and by applying your PR changes, it gets fixed. The workaround doesn't seem that intuitive to me, in my opinion, so it's better to have a proper fix.

I am all about merging it once rebasing and making green - we know the root cause, we know that gc and forks do not work well together because a) COW and this race condition - and similar approach was used in local executor and benchmarks shown that it's a good idea to deliberately handle gc on forking (and it follows the recommendations that were posted when gc.freeze() has been implemented in 3.7).

So I see no issue in following this one up - as long as it's green, rebased and tested.

@potiuk
Copy link
Member

potiuk commented Jan 6, 2026

We are also facing the same issue, and by applying your PR changes, it gets fixed.

And hearing that from someone who tested it in their own installation makes it even stronger.

@wjddn279
Copy link
Contributor Author

wjddn279 commented Jan 7, 2026

Yeap, since this issue is expected to be resolved in a future PR (applying gc.freeze), there won't be any additional work on this PR.

@potiuk
Copy link
Member

potiuk commented Jan 7, 2026

So... let's close it :). And work on the "complete" fix.

@potiuk potiuk closed this Jan 7, 2026
@john-rodriguez-mgni
Copy link

@potiuk is the plan to include the session.rollback() in the upcoming 3.1.6 release?

@potiuk
Copy link
Member

potiuk commented Jan 8, 2026

@potiuk is the plan to include the session.rollback() in the upcoming 3.1.6 release?

3.1.6rc1 is out for testing, but if someone provides a fix for it, then it might be included in 3.1.7 for example.

@potiuk
Copy link
Member

potiuk commented Jan 8, 2026

But you can see if 3.1.6rc1 fixes it - there were a number of fixes that could be related, if it still happens there, reporting it here might be helpful

@john-rodriguez-mgni
Copy link

@potiuk is the plan to include the session.rollback() in the upcoming 3.1.6 release?

3.1.6rc1 is out for testing, but if someone provides a fix for it, then it might be included in 3.1.7 for example.

I guess that's why I am confused because we had this PR:#56044 and from this comment: #56044 (comment) it looks like someone applied that PR to their installation and fixed the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core area:DAG-processing area:MetaDB Meta Database related issues. type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants