Skip to content

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Dec 11, 2025

Summary

Fixes a deadlock that occurs when awaiting intermediate DeploymentResponse objects in a chain of deployment calls.

Problem

The following code would hang indefinitely:

a = self.deployment_a.remote(number)
b = self.deployment_b.remote(a)
c = self.deployment_c.remote(b)

true_b = await b  # HANGS
true_c = await c

Interestingly, reversing the await order (await c then await b) worked fine.

Root Cause

When chained DeploymentResponse objects are used, the same ReplicaResult can be accessed from two different event loops concurrently:

  1. User's event loop (replica): When user code calls await b
  2. Router's event loop (singleton thread): When resolving b as an argument to c

Both code paths call ReplicaResult.to_object_ref_async(), which used an asyncio.Lock for synchronization. However, asyncio.Lock is not thread-safe and not designed for cross-event-loop usage, causing a deadlock.

Fix

Replace the asyncio.Lock with a non-blocking acquire pattern using the existing thread-safe threading.Lock:

Testing

Added test_chained_deployment_response_await_order which tests both await orders (b_first and c_first) to ensure they complete without hanging. Ran this test with all combinations of RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD and RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP env vars

Tested against the repro in #54201

Performance

Use the same repro script as given in the GH issue except i switch the order, i'e await c then await b so that i can run it on master.

locust --headless -u 100 -r 10 --host http://localhost:8000

Metric Branch Master Notes
Requests (#reqs) 4030 4465 Master handled 10.8% more requests in same test window
Requests/sec 80.39 80.83 Nearly identical throughput
Avg latency (ms) 1109 1118 Essentially the same
Median (p50) latency 650 ms 700 ms Branch slightly faster at p50
p90 latency 2200 ms 2200 ms Identical
p95 latency 2200 ms 2300 ms Master slightly slower
p99 latency 2400 ms 2500 ms Master slightly slower
Max latency 2519 ms 2760 ms Master has higher tail
Failures 0 0 Both perfect

…nt loops

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical deadlock issue that occurred when awaiting chained DeploymentResponse objects from different event loops. The root cause was correctly identified as the non-thread-safe nature of asyncio.Lock across event loops. The fix, which replaces it with a non-blocking acquire pattern on a threading.Lock, is a robust and well-implemented solution for cross-loop synchronization. The new logic in to_object_ref_async is sound, and the comprehensive new test test_chained_deployment_response_await_order thoroughly validates the fix by checking various await orders. Overall, this is an excellent change that significantly improves the stability of deployment composition in Ray Serve. I have one suggestion for a minor code cleanup.

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 11, 2025
@abrarsheikh abrarsheikh marked this pull request as ready for review December 11, 2025 13:49
@abrarsheikh abrarsheikh requested a review from a team as a code owner December 11, 2025 13:49
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Dec 11, 2025
@akyang-anyscale
Copy link
Contributor

in the repro, why does the user event loop being blocked prevent the router loop from making progress to resolve b?

@abrarsheikh
Copy link
Contributor Author

in the repro, why does the user event loop being blocked prevent the router loop from making progress to resolve b?

discussed offline, here is a small toy example to demo the behavior

def demo_the_actual_error():
    """
    Show the actual error that occurs.
    """
    print()
    print("=" * 70)
    print("DEMO 3: The actual error when using asyncio.Lock across loops")
    print("=" * 70)
    print()
    
    lock = asyncio.Lock()
    lock_acquired = threading.Event()
    
    def thread_a():
        async def hold_lock():
            async with lock:
                print("[Thread A] Acquired lock, holding for 1 second...")
                lock_acquired.set()
                await asyncio.sleep(1)
                print("[Thread A] Releasing lock...")
        
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(hold_lock())
        loop.close()
    
    def thread_b():
        lock_acquired.wait()  # Wait for Thread A to acquire the lock
        time.sleep(0.1)       # Small delay to ensure we try to acquire after
        
        async def try_acquire():
            print("[Thread B] Trying to acquire lock...")
            async with lock:  # This will fail!
                print("[Thread B] Got the lock!")
        
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        try:
            loop.run_until_complete(try_acquire())
        except RuntimeError as e:
            print(f"[Thread B] ERROR: {e}")
        finally:
            loop.close()
    
    t1 = threading.Thread(target=thread_a)
    t2 = threading.Thread(target=thread_b)
    
    t1.start()
    t2.start()
    t1.join()
    t2.join()

@abrarsheikh abrarsheikh merged commit 28f2cf8 into master Dec 13, 2025
7 checks passed
@abrarsheikh abrarsheikh deleted the 54201-abrar-dead branch December 13, 2025 01:08
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
ray-project#59385)

### Summary
Fixes a deadlock that occurs when awaiting intermediate
`DeploymentResponse` objects in a chain of deployment calls.

### Problem
The following code would hang indefinitely:
```python
a = self.deployment_a.remote(number)
b = self.deployment_b.remote(a)
c = self.deployment_c.remote(b)

true_b = await b  # HANGS
true_c = await c
```

Interestingly, reversing the await order (`await c` then `await b`)
worked fine.

### Root Cause
When chained `DeploymentResponse` objects are used, the same
`ReplicaResult` can be accessed from two different event loops
concurrently:

1. **User's event loop (replica)**: When user code calls `await b`
2. **Router's event loop (singleton thread)**: When resolving `b` as an
argument to `c`

Both code paths call `ReplicaResult.to_object_ref_async()`, which used
an `asyncio.Lock` for synchronization. However, `asyncio.Lock` is **not
thread-safe** and **not designed for cross-event-loop usage**, causing a
deadlock.

### Fix
Replace the `asyncio.Lock` with a non-blocking acquire pattern using the
existing thread-safe `threading.Lock`:

### Testing
Added `test_chained_deployment_response_await_order` which tests both
await orders (`b_first` and `c_first`) to ensure they complete without
hanging. Ran this test with all combinations of
`RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and
`RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` env vars

Tested against the repro in
ray-project#54201

### Performance

Use the same repro script as given in the GH issue except i switch the
order, i'e `await c` then `await b` so that i can run it on master.

`locust --headless -u 100 -r 10 --host http://localhost:8000`

Metric | Branch | Master | Notes
-- | -- | -- | --
Requests (#reqs) | 4030 | 4465 | Master handled 10.8% more requests in
same test window
Requests/sec | 80.39 | 80.83 | Nearly identical throughput
Avg latency (ms) | 1109 | 1118 | Essentially the same
Median (p50) latency | 650 ms | 700 ms | Branch slightly faster at p50
p90 latency | 2200 ms | 2200 ms | Identical
p95 latency | 2200 ms | 2300 ms | Master slightly slower
p99 latency | 2400 ms | 2500 ms | Master slightly slower
Max latency | 2519 ms | 2760 ms | Master has higher tail
Failures | 0 | 0 | Both perfect

---------

Signed-off-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants