Skip to content

Conversation

@kouroshHakha
Copy link
Contributor

@kouroshHakha kouroshHakha commented Jun 28, 2025

Addresses #53533

Made ready for review for copilot review not ready for human review yet.

Need to merge vllm changes as well:

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha changed the title [Serve.llm][Prototype] Simplify LLMServer and inherit OpenAIServingChat behavior [Serve.llm][Prototype][WIP] Simplify LLMServer and inherit OpenAIServingChat behavior Jun 28, 2025
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…t testing

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha marked this pull request as ready for review July 8, 2025 19:12
Copilot AI review requested due to automatic review settings July 8, 2025 19:12
@kouroshHakha kouroshHakha requested review from a team as code owners July 8, 2025 19:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Serve LLM stack to simplify the LLMServer by inheriting directly from vLLM’s OpenAI‐serving components, removes legacy processing layers, and unifies chat, completion, and embedding flows.

  • Consolidates engine APIs to use vLLM’s OpenAIServingChat, Completion, and Embedding endpoints
  • Refactors and shrinks LLMServer, removing ResponsePostprocessor and redundant code
  • Updates mocks and tests to align with new request/response types and metadata fields

Reviewed Changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated no comments.

Show a summary per file
File Description
release/llm_tests/serve/probes/test_basic.py Adjusted max‐token length and updated invalid logprobs test cases
python/ray/llm/tests/serve/mocks/mock_vllm_engine.py Overhauled mock engine to yield SSE strings and OpenAI‐style responses
python/ray/llm/_internal/serve/deployments/llm/llm_server.py Simplified server logic; unified request→engine→batching pipeline
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py Integrated vLLM’s OpenAIServingChat/Completion/Embedding APIs
python/ray/llm/tests/serve/utils/testing_utils.py Added reusable response‐validation helpers
Comments suppressed due to low confidence (4)

release/llm_tests/serve/probes/test_basic.py:319

  • The upper‐bound case for top_logprobs (> maximum allowed) was removed. To fully validate error handling, reintroduce a test value above the allowed maximum.
    invalid_num_logprobs = [-1]

python/ray/llm/tests/serve/mocks/mock_vllm_engine.py:5

  • The mock implementation uses await asyncio.sleep(...) but asyncio is not imported. Add import asyncio at the top.
from typing import AsyncGenerator, Dict, Optional, Any, List, Union

python/ray/llm/tests/serve/conftest.py:17

  • Fixture imports EmbeddingCompletionRequest, but the actual class is named EmbeddingRequest. Update the import and fixture accordingly.
from ray.llm._internal.serve.configs.openai_api_models import (

release/llm_tests/serve/probes/test_basic.py:163

  • [nitpick] Using a raw literal 200000 can obscure its meaning. Consider replacing it with a named constant or adding a comment explaining its origin.
    length = 200000

@kouroshHakha
Copy link
Contributor Author

kouroshHakha commented Jul 8, 2025

Here is the tentative plan to break these changes up into smaller PRs:

  1. vllm disable v0 completely
  1. Upstream changes on node init code path [Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (3/N): Remove indirection layers of node initialization #54481
  • serve_models
    • change to _model_architecture: str = PrivateAttr("UNSPECIFIED")
    • change to if hasattr(hf_config, "architectures") and hf_config.architectures:
  • llm_server
    • changes to if self._llm_config.model_architecture:
  1. Clean up llm_server abstraction [Serve.llm] Make llm serve endpoints compatible with vLLM serve frontend (4/N): Refactor LLMServer #54484
  • remove llm_config from mandator init on server base
  • define _init_multiplex_loader
  • define _maybe_add_request_id_to_request
  • define _maybe_resolve_lora_from_multiplex(request)
  • define _batch_output_stream
  1. Change the engine abstraction and llm server usage of that abstraction
  • In here we should have everything working end to end
  • Use vLLM OpenAI models instead of vendording our own copy
  1. (optional) vllm 0.10.0 bump (to get the FrontendArgs and lora model checker changes)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant