-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Expected behavior
The PopenPoolExecutor should shut down gracefully without causing deadlocks, especially when __del__ is invoked during garbage collection.
Actual behavior
Calling _threadpool.shutdown() within PopenPoolExecutor.__del__() can lead to a deadlock. The shutdown() function attemps calls join() on all threads, which in turn attempts to acquire _shutdown_locks_lock. Due to the unpredictable timing of garbage collection, the __del__() function might try to acquire _shutdown_locks_lock while the current thread already holds it (e.g., if that thread is creating new threads), leading to a deadlock.
This issue arises because the LocalBuilder relies on Python's del statement for pool cleanup, rather than explicitly calling a shutdown() method.
According to Python's documentation:
Note
del xdoesn’t directly callx.__del__()— the former decrements the reference count forxby one, and the latter is only called whenx’s reference count reaches zero.
__del__()can be invoked when arbitrary code is being executed, including from any arbitrary thread. If__del__()needs to take a lock or invoke any other blocking resource, it may deadlock as the resource may already be taken by the code that gets interrupted to execute__del__().
Environment
- Operating System: 22.04.1-Ubuntu
- Python v3.12.9
- TVM version: commit 5e12a5c
Steps to reproduce
Run a builder in loop:
import tvm
from tvm.meta_schedule.builder import LocalBuilder, BuilderInput
from tvm.script import relax as R
import datetime
# A simple Relax function
@R.function
def relax_matmul(
data: R.Tensor(("n", 784), dtype="float32"),
w0: R.Tensor((784, 128), dtype="float32"),
) -> R.Tensor(("n", 128), dtype="float32"):
with R.dataflow():
lv0 = R.matmul(data, w0)
R.output(lv0)
return lv0
mod = tvm.IRModule.from_expr(relax_matmul)
target = tvm.target.Target("nvidia/geforce-rtx-4090")
builder = LocalBuilder()
while True:
builder_inputs = [BuilderInput(mod, target) for _ in range(builder.max_workers)]
builder_results = builder.build(builder_inputs)
timestamp = datetime.datetime.now().strftime("%H:%M:%S")
print(f"[{timestamp}] Build Successful")Adding logging statements before and after del in local_builder.py, and within __del__, demonstrates the behavior: (the first number on each line is threading.get_ident())
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>...
139623815455936: After del.
[12:36:51] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: After del.
[12:37:06] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b99c170>...
139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c535130>'s thread pool...
139613146445568: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>'s thread pool...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b99c170>...
139623815455936: After del.
[12:37:48] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>...
139623815455936: After del.
[12:38:03] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>...
139623815455936: After del.
[12:38:18] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9821b0>...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9821b0>...
139623815455936: After del.
[12:38:33] Build Successful
139623815455936: Creating pool <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9762a0>...
139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1c037ce0>'s thread pool...
139611577763584: Shutting down <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1bb82150>'s thread pool...
139623815455936: Try to del <tvm.contrib.popen_pool.PopenPoolExecutor object at 0x7efb1b9762a0>...
139623815455936: After del.
The deadlock may occur randomly after a long time (it took me several hours to identify the cause).
To immediately trigger the deadlock, modify threading.py by adding gc.collect() within the _maintain_shutdown_locks() function:
def _maintain_shutdown_locks():
"""
Drop any shutdown locks that don't correspond to running threads anymore.
Calling this from time to time avoids an ever-growing _shutdown_locks
set when Thread objects are not joined explicitly. See bpo-37788.
This must be called with _shutdown_locks_lock acquired.
"""
# If a lock was released, the corresponding thread has exited
import gc
gc.collect()
to_remove = [lock for lock in _shutdown_locks if not lock.locked()]
_shutdown_locks.difference_update(to_remove)The call stack at the time of the deadlock (captured via py-spy dump) is as follows:
Thread 744194 (idle): "MainThread"
wait (threading.py:355)
wait (threading.py:655)
start (threading.py:1001) <- Start a new thread (Dummy-1)
_adjust_thread_count (concurrent/futures/thread.py:203)
submit (concurrent/futures/thread.py:180)
map (concurrent/futures/_base.py:608)
map_with_error_catching (tvm/contrib/popen_pool.py:452)
build (tvm/meta_schedule/builder/local_builder.py:164)
method (tvm/meta_schedule/utils.py:76)
build (tvm/meta_schedule/builder/builder.py:129)
<module> (pool_bug.py:25)
Thread 746704 (idle): "Dummy-1"
_stop (threading.py:1106) <- try to acquire the lock again
_wait_for_tstate_lock (threading.py:1173)
join (threading.py:1151)
shutdown (concurrent/futures/thread.py:239)
__del__ (tvm/contrib/popen_pool.py:381) <- GC interrupts
_maintain_shutdown_locks (threading.py:868)
_set_tstate_lock (threading.py:1057) <- acquire the lock
_bootstrap_inner (threading.py:1063)
_bootstrap (threading.py:1034) <- Starting...
Temporary Solution
Add a shutdown() function to PopenPoolExecutor, and explicitly call it in LocalBuilder.
class PopenPoolExecutor:
def __init__(
self,
max_workers=None,
timeout=None,
initializer=None,
initargs=(),
maximum_process_uses=None,
stdout=None,
stderr=None,
):
# ...
self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
# ...
self._shutdown = False
def __del__(self):
if not self._shutdown:
self.shutdown()
def shutdown(self):
# original __del__
self._shutdown = TrueTriage
- needs-triage