[Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior #54189

kouroshHakha · 2025-06-28T02:35:43Z

Addresses #53533

Made ready for review for copilot review not ready for human review yet.

Need to merge vllm changes as well:

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…t testing Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Copilot

Pull Request Overview

This PR refactors the Serve LLM stack to simplify the LLMServer by inheriting directly from vLLM’s OpenAI‐serving components, removes legacy processing layers, and unifies chat, completion, and embedding flows.

Consolidates engine APIs to use vLLM’s OpenAIServingChat, Completion, and Embedding endpoints
Refactors and shrinks LLMServer, removing ResponsePostprocessor and redundant code
Updates mocks and tests to align with new request/response types and metadata fields

Reviewed Changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
release/llm_tests/serve/probes/test_basic.py	Adjusted max‐token length and updated invalid logprobs test cases
python/ray/llm/tests/serve/mocks/mock_vllm_engine.py	Overhauled mock engine to yield SSE strings and OpenAI‐style responses
python/ray/llm/_internal/serve/deployments/llm/llm_server.py	Simplified server logic; unified request→engine→batching pipeline
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py	Integrated vLLM’s OpenAIServingChat/Completion/Embedding APIs
python/ray/llm/tests/serve/utils/testing_utils.py	Added reusable response‐validation helpers

Comments suppressed due to low confidence (4)

release/llm_tests/serve/probes/test_basic.py:319

The upper‐bound case for top_logprobs (> maximum allowed) was removed. To fully validate error handling, reintroduce a test value above the allowed maximum.

    invalid_num_logprobs = [-1]

python/ray/llm/tests/serve/mocks/mock_vllm_engine.py:5

The mock implementation uses await asyncio.sleep(...) but asyncio is not imported. Add import asyncio at the top.

from typing import AsyncGenerator, Dict, Optional, Any, List, Union

python/ray/llm/tests/serve/conftest.py:17

Fixture imports EmbeddingCompletionRequest, but the actual class is named EmbeddingRequest. Update the import and fixture accordingly.

from ray.llm._internal.serve.configs.openai_api_models import (

release/llm_tests/serve/probes/test_basic.py:163

[nitpick] Using a raw literal 200000 can obscure its meaning. Consider replacing it with a named constant or adding a comment explaining its origin.

    length = 200000

kouroshHakha · 2025-07-08T19:41:05Z

Here is the tentative plan to break these changes up into smaller PRs:

vllm disable v0 completely

make the default v1 [Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (1/N): Remove random v0 logic #54440
make engine constructor take llm_config as input
vllm_models.py
- changes to def get_initialization_kwargs(self) [Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (2/N): Incremental refactor of LLMEngine #54443

Upstream changes on node init code path [Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (3/N): Remove indirection layers of node initialization #54481

serve_models
- change to _model_architecture: str = PrivateAttr("UNSPECIFIED")
- change to if hasattr(hf_config, "architectures") and hf_config.architectures:
llm_server
- changes to if self._llm_config.model_architecture:

Clean up llm_server abstraction [Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (4/N): Refactor LLMServer #54484

remove llm_config from mandator init on server base
define _init_multiplex_loader
define _maybe_add_request_id_to_request
define _maybe_resolve_lora_from_multiplex(request)
define _batch_output_stream

Change the engine abstraction and llm server usage of that abstraction

In here we should have everything working end to end
Use vLLM OpenAI models instead of vendording our own copy

(optional) vllm 0.10.0 bump (to get the FrontendArgs and lora model checker changes)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha added 7 commits June 24, 2025 10:36

wip

037bd7f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

prototype api server

0b0a5d8

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

7dfabde

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed error handling and lora

07d42fb

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

eddc710

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

mistral

14e5263

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

batching is also done

cdfb32c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha changed the title ~~[Serve.llm][Prototype] Simplify LLMServer and inherit OpenAIServingChat behavior~~ [Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior Jun 28, 2025

kouroshHakha added 22 commits June 28, 2025 12:54

wip

dbb2db7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'master' into kh/proto-api-server

9b9bbde

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

2fc73d9

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

00ac868

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

02e5ecf

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

7640a92

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

8df78df

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

c7d67b5

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

0e97923

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

simplify by reusing vllm apis

1d74fc9

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

59ac15a

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

89002a7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

5db78c7

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

e39daf2

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

for embedding user must set VLLM_USE_V1=0

4fe3cef

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added self contained test for first llm engine mock

b1c0163

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

testing llm_server now with refactor testing utils for more consisten…

f385cf2

…t testing Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added lora logic back and tested the request_id handling from serve

ccd188b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

tested multiplexing

61e8902

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

88a45e0

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

added telemetry tests

4e9a3d2

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

remove tests that we already had a good coverage on

343a395

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha added 8 commits July 2, 2025 16:29

fix test_router

e0470cc

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

pd

e9725c3

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

delete dead code

d4d8a8d

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

bc3e7bc

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

delete more dead code

a4790e3

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

a0ad597

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

fixed the probes

c05c83f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

e2c6171

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha marked this pull request as ready for review July 8, 2025 19:12

Copilot AI review requested due to automatic review settings July 8, 2025 19:12

kouroshHakha requested review from a team as code owners July 8, 2025 19:12

Copilot AI reviewed Jul 8, 2025

View reviewed changes

kouroshHakha added 2 commits July 10, 2025 18:21

Merge branch 'master' into kh/proto-api-server

51f7deb

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

33de82c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha mentioned this pull request Jul 11, 2025

[Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (5/N): Change source of dashboards from ray serve llm metrics to vllm metrics #54531

Merged

8 tasks

kouroshHakha closed this Jul 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior #54189

[Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior #54189

Uh oh!

kouroshHakha commented Jun 28, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

kouroshHakha commented Jul 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior #54189

[Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior #54189

Uh oh!

Conversation

kouroshHakha commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

kouroshHakha commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kouroshHakha commented Jun 28, 2025 •

edited

Loading

kouroshHakha commented Jul 8, 2025 •

edited

Loading