Skip to content

Conversation

simon-mo
Copy link
Collaborator

Summary

  • stream token outputs in vllm chat CLI
  • stream token outputs in vllm complete CLI
  • factor out streaming loops into reusable helper functions

Testing

  • ruff check vllm/entrypoints/cli/openai.py
  • python -m py_compile vllm/entrypoints/cli/openai.py
  • pre-commit run --files vllm/entrypoints/cli/openai.py (fails: command not found)
  • pytest tests/v1/entrypoints/openai/test_completion.py::test_completion_streaming -q (fails: ModuleNotFoundError: torch)

https://chatgpt.com/codex/tasks/task_e_68af54ff5ee88329b50c13bf46c0da0d

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces streaming support for the chat and complete CLI commands, which is a great enhancement for user experience. The implementation is straightforward, using helper functions to handle the streaming logic. My review includes suggestions to improve robustness by adding error handling around the streaming API calls and to enhance code clarity by adding type hints to the new helper functions. These changes will make the CLI tool more resilient and the code easier to maintain.

return model_name, openai_client


def _print_chat_stream(stream) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To improve type safety and code readability, please add a type hint for the stream parameter. The openai client returns a Stream of ChatCompletionChunk objects. Using a string forward reference for the type hint is a good practice here.

You'll need to ensure the necessary types are imported within a TYPE_CHECKING block:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from openai import Stream
    from openai.types.chat import ChatCompletionChunk
Suggested change
def _print_chat_stream(stream) -> str:
def _print_chat_stream(stream: "Stream[ChatCompletionChunk]") -> str:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return output


def _print_completion_stream(stream) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency and to improve type safety, please add a type hint for the stream parameter. The openai client returns a Stream of Completion objects for completion requests.

You'll need to ensure the necessary types are imported within a TYPE_CHECKING block:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from openai import Stream
    from openai.types import Completion
Suggested change
def _print_completion_stream(stream) -> str:
def _print_completion_stream(stream: "Stream[Completion]") -> str:

Comment on lines 83 to 86
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The streaming API call can raise exceptions (e.g., openai.APIError) if an issue occurs during generation. To prevent the CLI from crashing and to provide a better user experience, it's best to wrap the streaming logic in a try...except block to gracefully handle any potential errors.

        try:
            stream = client.chat.completions.create(
                model=model_name, messages=conversation, stream=True)
            output = _print_chat_stream(stream)
            conversation.append({"role": "assistant", "content": output})
        except Exception as e:
            print(f"\nAn error occurred: {e}")

Comment on lines 129 to 132
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.chat.completions.create(
                    model=model_name, messages=conversation, stream=True)
                output = _print_chat_stream(stream)
                conversation.append({"role": "assistant", "content": output})
            except Exception as e:
                print(f"\nAn error occurred: {e}")

Comment on lines 143 to 146
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.chat.completions.create(
                    model=model_name, messages=conversation, stream=True)
                output = _print_chat_stream(stream)
                conversation.append({"role": "assistant", "content": output})
            except Exception as e:
                print(f"\nAn error occurred: {e}")

Comment on lines +186 to +189
stream = client.completions.create(model=model_name,
prompt=args.quick,
stream=True)
_print_completion_stream(stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.completions.create(model=model_name,
                                                   prompt=args.quick,
                                                   stream=True)
                _print_completion_stream(stream)
            except Exception as e:
                print(f"\nAn error occurred: {e}")

Comment on lines +198 to +201
stream = client.completions.create(model=model_name,
prompt=input_prompt,
stream=True)
_print_completion_stream(stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.completions.create(model=model_name,
                                                   prompt=input_prompt,
                                                   stream=True)
                _print_completion_stream(stream)
            except Exception as e:
                print(f"\nAn error occurred: {e}")

Signed-off-by: simon-mo <simon.mo@hey.com>
@simon-mo simon-mo changed the title Enable streaming in CLI chat and completion commands [CLI] Use streaming in CLI chat and completion commands Sep 18, 2025
@simon-mo simon-mo merged commit e111d5b into main Sep 18, 2025
18 checks passed
@simon-mo simon-mo deleted the codex/enable-streaming-for-vllm-completechat branch September 18, 2025 05:30
@simon-mo
Copy link
Collaborator Author

@chaunceyjiang sorry just saw the comments. i was mostly looking at the CLI file don't really have much type hint and is designed to be simple demo code.

845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 18, 2025
…litPR into model_register

* 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits)
  Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085)
  [Docs] Fix API Reference (vllm-project#25140)
  [Kernel] Better inf handling for grouped topk cu (vllm-project#24886)
  [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769)
  [benchmark] add peak throughput metrics and plot (vllm-project#23867)
  [Spec Decode] Efficient padded speculation (vllm-project#24539)
  [V0 Deprecation] Remove more V0 tests (vllm-project#25117)
  [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078)
  [XPU] Whisper model support on XPU Platform (vllm-project#25123)
  Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077)
  [Model] enable data parallel for InternVL vision encoder (vllm-project#23909)
  [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254)
  [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960)
  [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006)
  [Docs] Clean up the contributing README (vllm-project#25099)
  [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955)
  [Kernels] Enable DeepGEMM by default (vllm-project#24462)
  [V0 Deprecation] Skip PP test (vllm-project#25128)
  [V0 Deprecation] Remove misc V0 tests (vllm-project#25118)
  [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115)
  ...
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: charlifu <charlifu@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants