-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Serve] Resolve deadlock when replica result is accessed from two eve… #59385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nt loops Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical deadlock issue that occurred when awaiting chained DeploymentResponse objects from different event loops. The root cause was correctly identified as the non-thread-safe nature of asyncio.Lock across event loops. The fix, which replaces it with a non-blocking acquire pattern on a threading.Lock, is a robust and well-implemented solution for cross-loop synchronization. The new logic in to_object_ref_async is sound, and the comprehensive new test test_chained_deployment_response_await_order thoroughly validates the fix by checking various await orders. Overall, this is an excellent change that significantly improves the stability of deployment composition in Ray Serve. I have one suggestion for a minor code cleanup.
Signed-off-by: abrar <abrar@anyscale.com>
|
in the repro, why does the user event loop being blocked prevent the router loop from making progress to resolve b? |
discussed offline, here is a small toy example to demo the behavior def demo_the_actual_error():
"""
Show the actual error that occurs.
"""
print()
print("=" * 70)
print("DEMO 3: The actual error when using asyncio.Lock across loops")
print("=" * 70)
print()
lock = asyncio.Lock()
lock_acquired = threading.Event()
def thread_a():
async def hold_lock():
async with lock:
print("[Thread A] Acquired lock, holding for 1 second...")
lock_acquired.set()
await asyncio.sleep(1)
print("[Thread A] Releasing lock...")
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(hold_lock())
loop.close()
def thread_b():
lock_acquired.wait() # Wait for Thread A to acquire the lock
time.sleep(0.1) # Small delay to ensure we try to acquire after
async def try_acquire():
print("[Thread B] Trying to acquire lock...")
async with lock: # This will fail!
print("[Thread B] Got the lock!")
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(try_acquire())
except RuntimeError as e:
print(f"[Thread B] ERROR: {e}")
finally:
loop.close()
t1 = threading.Thread(target=thread_a)
t2 = threading.Thread(target=thread_b)
t1.start()
t2.start()
t1.join()
t2.join() |
ray-project#59385) ### Summary Fixes a deadlock that occurs when awaiting intermediate `DeploymentResponse` objects in a chain of deployment calls. ### Problem The following code would hang indefinitely: ```python a = self.deployment_a.remote(number) b = self.deployment_b.remote(a) c = self.deployment_c.remote(b) true_b = await b # HANGS true_c = await c ``` Interestingly, reversing the await order (`await c` then `await b`) worked fine. ### Root Cause When chained `DeploymentResponse` objects are used, the same `ReplicaResult` can be accessed from two different event loops concurrently: 1. **User's event loop (replica)**: When user code calls `await b` 2. **Router's event loop (singleton thread)**: When resolving `b` as an argument to `c` Both code paths call `ReplicaResult.to_object_ref_async()`, which used an `asyncio.Lock` for synchronization. However, `asyncio.Lock` is **not thread-safe** and **not designed for cross-event-loop usage**, causing a deadlock. ### Fix Replace the `asyncio.Lock` with a non-blocking acquire pattern using the existing thread-safe `threading.Lock`: ### Testing Added `test_chained_deployment_response_await_order` which tests both await orders (`b_first` and `c_first`) to ensure they complete without hanging. Ran this test with all combinations of `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` env vars Tested against the repro in ray-project#54201 ### Performance Use the same repro script as given in the GH issue except i switch the order, i'e `await c` then `await b` so that i can run it on master. `locust --headless -u 100 -r 10 --host http://localhost:8000` Metric | Branch | Master | Notes -- | -- | -- | -- Requests (#reqs) | 4030 | 4465 | Master handled 10.8% more requests in same test window Requests/sec | 80.39 | 80.83 | Nearly identical throughput Avg latency (ms) | 1109 | 1118 | Essentially the same Median (p50) latency | 650 ms | 700 ms | Branch slightly faster at p50 p90 latency | 2200 ms | 2200 ms | Identical p95 latency | 2200 ms | 2300 ms | Master slightly slower p99 latency | 2400 ms | 2500 ms | Master slightly slower Max latency | 2519 ms | 2760 ms | Master has higher tail Failures | 0 | 0 | Both perfect --------- Signed-off-by: abrar <abrar@anyscale.com>
Summary
Fixes a deadlock that occurs when awaiting intermediate
DeploymentResponseobjects in a chain of deployment calls.Problem
The following code would hang indefinitely:
Interestingly, reversing the await order (
await cthenawait b) worked fine.Root Cause
When chained
DeploymentResponseobjects are used, the sameReplicaResultcan be accessed from two different event loops concurrently:await bbas an argument tocBoth code paths call
ReplicaResult.to_object_ref_async(), which used anasyncio.Lockfor synchronization. However,asyncio.Lockis not thread-safe and not designed for cross-event-loop usage, causing a deadlock.Fix
Replace the
asyncio.Lockwith a non-blocking acquire pattern using the existing thread-safethreading.Lock:Testing
Added
test_chained_deployment_response_await_orderwhich tests both await orders (b_firstandc_first) to ensure they complete without hanging. Ran this test with all combinations ofRAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREADandRAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOPenv varsTested against the repro in #54201
Performance
Use the same repro script as given in the GH issue except i switch the order, i'e
await cthenawait bso that i can run it on master.locust --headless -u 100 -r 10 --host http://localhost:8000