[Serve.llm][P/D] Fix health check in prefill disagg #53937

kouroshHakha · 2025-06-18T22:49:23Z

Also needs vllm-project/vllm#19821 to be merged.

Needs testing: https://buildkite.com/ray-project/release/builds/46098

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2025-06-18T22:50:34Z

python/ray/llm/_internal/serve/deployments/llm/llm_server.py


        self.response_postprocessor = ResponsePostprocessor()

-    @property


This is useless. I don't know why we had it. removing.

kouroshHakha · 2025-06-18T22:53:19Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py


        try:
-            return await asyncio.wait_for(self.engine.check_health(), timeout=15)
+            await asyncio.wait_for(self.engine.check_health(), timeout=15)


removed the timeout time since ray serve has an adjustable timeout per deployment anyways.

kouroshHakha · 2025-06-18T22:53:59Z

python/ray/llm/_internal/serve/deployments/prefill_decode_disagg/prefill_decode_disagg.py

        ):
            yield chunk

-    async def check_health(self) -> None:


These must be removed. In general the health check of a deployment is not bounded to the health check of its child deployments.

cc @lk-chen fyi

kouroshHakha · 2025-06-18T22:54:08Z

python/ray/llm/_internal/serve/deployments/routers/router.py


    async def check_health(self):
        await self._init_completed.wait()
-        await asyncio.gather(


same thing applies here.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2025-06-19T02:49:50Z

release/llm_tests/serve/test_llm_serve_integration.py

        model="Qwen/Qwen2.5-0.5B-Instruct",
        dtype="auto",
        disable_log_stats=False,
+        enforce_eager=True,


Making the tests faster

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2025-06-20T21:24:40Z

python/ray/llm/tests/serve/cpu/deployments/routers/test_router.py


        await router.check_health()

-        assert server.check_health.remote.call_count == 1


testing router's health check has nothing to do with server's health check.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Original PR #53937 by kouroshHakha Original: ray-project/ray#53937

…isagg Merged from original PR #53937 Original: ray-project/ray#53937

wip

ece1265

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha commented Jun 18, 2025

View reviewed changes

kouroshHakha added 3 commits June 18, 2025 15:54

wip

ba3eeb3

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

efe6f92

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

7c1a5f0

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha commented Jun 19, 2025

View reviewed changes

kouroshHakha marked this pull request as ready for review June 19, 2025 02:50

kouroshHakha requested a review from a team as a code owner June 19, 2025 02:50

kouroshHakha requested a review from eicherseiji June 19, 2025 02:52

eicherseiji approved these changes Jun 19, 2025

View reviewed changes

python/ray/llm/_internal/serve/deployments/llm/llm_server.py Show resolved Hide resolved

eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 19, 2025

kouroshHakha added 3 commits June 19, 2025 22:13

wip

a58e780

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

d5576a5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

40ab6af

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha commented Jun 20, 2025

View reviewed changes

kouroshHakha mentioned this pull request Jun 20, 2025

[Serve.llm] Remove ImageRetriever class and related tests from the LLM deployment module. #53980

Merged

kouroshHakha merged commit 55b8ce9 into ray-project:master Jun 22, 2025
5 checks passed

minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025

[Serve.llm][P/D] Fix health check in prefill disagg (ray-project#53937)

cb20f16

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

elliot-barn pushed a commit that referenced this pull request Jul 2, 2025

[Serve.llm][P/D] Fix health check in prefill disagg (#53937)

40a3cc2

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

snorkelopstesting3-bot mentioned this pull request Oct 22, 2025

[Serve.llm][P/D] Fix health check in prefill disagg snorkel-marlin-repos/ray-project_ray_pr_53937_454bb92f-28e4-4f99-921b-de7d319af18f#1

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve.llm][P/D] Fix health check in prefill disagg #53937

[Serve.llm][P/D] Fix health check in prefill disagg #53937

Uh oh!

kouroshHakha commented Jun 18, 2025 •

edited

Loading

Uh oh!

kouroshHakha Jun 18, 2025

Uh oh!

kouroshHakha Jun 18, 2025

Uh oh!

kouroshHakha Jun 18, 2025

Uh oh!

kouroshHakha Jun 20, 2025

Uh oh!

kouroshHakha Jun 18, 2025

Uh oh!

kouroshHakha Jun 19, 2025

Uh oh!

Uh oh!

kouroshHakha Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		self.response_postprocessor = ResponsePostprocessor()

		@property


		await router.check_health()

		assert server.check_health.remote.call_count == 1

[Serve.llm][P/D] Fix health check in prefill disagg #53937

[Serve.llm][P/D] Fix health check in prefill disagg #53937

Uh oh!

Conversation

kouroshHakha commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kouroshHakha Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kouroshHakha Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kouroshHakha commented Jun 18, 2025 •

edited

Loading