Skip to content

[Bug] Deadlock in PopenPoolExecutor During Garbage Collection #18157

@w1049

Description

@w1049

Expected behavior

The PopenPoolExecutor should shut down gracefully without causing deadlocks, especially when __del__ is invoked during garbage collection.

Actual behavior

Calling _threadpool.shutdown() within PopenPoolExecutor.__del__() can lead to a deadlock. The shutdown() function attemps calls join() on all threads, which in turn attempts to acquire _shutdown_locks_lock. Due to the unpredictable timing of garbage collection, the __del__() function might try to acquire _shutdown_locks_lock while the current thread already holds it (e.g., if that thread is creating new threads), leading to a deadlock.

This issue arises because the LocalBuilder relies on Python's del statement for pool cleanup, rather than explicitly calling a shutdown() method.
According to Python's documentation:

Note del x doesn’t directly call x.__del__() — the former decrements the reference count for x by one, and the latter is only called when x’s reference count reaches zero.

__del__() can be invoked when arbitrary code is being executed, including from any arbitrary thread. If __del__() needs to take a lock or invoke any other blocking resource, it may deadlock as the resource may already be taken by the code that gets interrupted to execute __del__().

Environment

  • Operating System: 22.04.1-Ubuntu
  • Python v3.12.9
  • TVM version: commit 5e12a5c

Steps to reproduce

Run a builder in loop:

import tvm
from tvm.meta_schedule.builder import LocalBuilder, BuilderInput
from tvm.script import relax as R
import datetime


# A simple Relax function
@R.function
def relax_matmul(
    data: R.Tensor(("n", 784), dtype="float32"),
    w0: R.Tensor((784, 128), dtype="float32"),
) -> R.Tensor(("n", 128), dtype="float32"):
    with R.dataflow():
        lv0 = R.matmul(data, w0)
        R.output(lv0)
    return lv0


mod = tvm.IRModule.from_expr(relax_matmul)
target = tvm.target.Target("nvidia/geforce-rtx-4090")
builder = LocalBuilder()

while True:
    builder_inputs = [BuilderInput(mod, target) for _ in range(builder.max_workers)]
    builder_results = builder.build(builder_inputs)

    timestamp = datetime.datetime.now().strftime("%H:%M:%S")
    print(f"[{timestamp}] Build Successful")

Adding logging statements before and after del in local_builder.py, and within __del__, demonstrates the behavior: (the first number on each line is threading.get_ident())

139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>...
139623815455936: After del.
[12:36:51] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: After del.
[12:37:06] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b99c170>...
139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>'s thread pool...
139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>'s thread pool...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b99c170>...
139623815455936: After del.
[12:37:48] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>...
139623815455936: After del.
[12:38:03] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: After del.
[12:38:18] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9821b0>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9821b0>...
139623815455936: After del.
[12:38:33] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9762a0>...
139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>'s thread pool...
139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>'s thread pool...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9762a0>...
139623815455936: After del.

The deadlock may occur randomly after a long time (it took me several hours to identify the cause).
To immediately trigger the deadlock, modify threading.py by adding gc.collect() within the _maintain_shutdown_locks() function:

def _maintain_shutdown_locks():
    """
    Drop any shutdown locks that don't correspond to running threads anymore.

    Calling this from time to time avoids an ever-growing _shutdown_locks
    set when Thread objects are not joined explicitly. See bpo-37788.

    This must be called with _shutdown_locks_lock acquired.
    """
    # If a lock was released, the corresponding thread has exited
    import gc
    gc.collect()
    to_remove = [lock for lock in _shutdown_locks if not lock.locked()]
    _shutdown_locks.difference_update(to_remove)

The call stack at the time of the deadlock (captured via py-spy dump) is as follows:

Thread 744194 (idle): "MainThread"
    wait (threading.py:355)
    wait (threading.py:655)
    start (threading.py:1001) <- Start a new thread (Dummy-1)
    _adjust_thread_count (concurrent/futures/thread.py:203)
    submit (concurrent/futures/thread.py:180)
    map (concurrent/futures/_base.py:608)
    map_with_error_catching (tvm/contrib/popen_pool.py:452)
    build (tvm/meta_schedule/builder/local_builder.py:164)
    method (tvm/meta_schedule/utils.py:76)
    build (tvm/meta_schedule/builder/builder.py:129)
    <module> (pool_bug.py:25)
Thread 746704 (idle): "Dummy-1"
    _stop (threading.py:1106) <- try to acquire the lock again
    _wait_for_tstate_lock (threading.py:1173)
    join (threading.py:1151)
    shutdown (concurrent/futures/thread.py:239)
    __del__ (tvm/contrib/popen_pool.py:381) <- GC interrupts
    _maintain_shutdown_locks (threading.py:868)
    _set_tstate_lock (threading.py:1057) <- acquire the lock
    _bootstrap_inner (threading.py:1063)
    _bootstrap (threading.py:1034) <- Starting...

Temporary Solution

Add a shutdown() function to PopenPoolExecutor, and explicitly call it in LocalBuilder.

class PopenPoolExecutor:
    def __init__(
        self,
        max_workers=None,
        timeout=None,
        initializer=None,
        initargs=(),
        maximum_process_uses=None,
        stdout=None,
        stderr=None,
    ):
        # ...
        self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
        # ...
        self._shutdown = False

    def __del__(self):
        if not self._shutdown:
            self.shutdown()
    
    def shutdown(self):
        # original __del__ 
        self._shutdown = True

Triage

  • needs-triage

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions