[CLI] Use streaming in CLI chat and completion commands #23769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

simon-mo merged 3 commits into main from codex/enable-streaming-for-vllm-completechat

Sep 18, 2025

Collaborator

simon-mo commented Aug 27, 2025

Summary

stream token outputs in vllm chat CLI
stream token outputs in vllm complete CLI
factor out streaming loops into reusable helper functions

Testing

ruff check vllm/entrypoints/cli/openai.py
python -m py_compile vllm/entrypoints/cli/openai.py
pre-commit run --files vllm/entrypoints/cli/openai.py (fails: command not found)
pytest tests/v1/entrypoints/openai/test_completion.py::test_completion_streaming -q (fails: ModuleNotFoundError: torch)

https://chatgpt.com/codex/tasks/task_e_68af54ff5ee88329b50c13bf46c0da0d


          refactor streaming output loops

d0b07bb

simon-mo requested a review from aarnphm as a code owner

August 27, 2025 20:27

simon-mo added the codex label

— with

ChatGPT Codex Connector

mergify bot added the frontend label

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces streaming support for the chat and complete CLI commands, which is a great enhancement for user experience. The implementation is straightforward, using helper functions to handle the streaming logic. My review includes suggestions to improve robustness by adding error handling around the streaming API calls and to enhance code clarity by adding type hints to the new helper functions. These changes will make the CLI tool more resilient and the code easier to maintain.

vllm/entrypoints/cli/openai.py

    
                  return model_name, openai_client

              def _print_chat_stream(stream) -> str:

Contributor

gemini-code-assist bot Aug 27, 2025

To improve type safety and code readability, please add a type hint for the stream parameter. The openai client returns a Stream of ChatCompletionChunk objects. Using a string forward reference for the type hint is a good practice here.

You'll need to ensure the necessary types are imported within a TYPE_CHECKING block:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from openai import Stream
    from openai.types.chat import ChatCompletionChunk

Suggested change

      
            def _print_chat_stream(stream) -> str:
          
            def _print_chat_stream(stream: "Stream[ChatCompletionChunk]") -> str:

Collaborator

chaunceyjiang Sep 18, 2025

+1

vllm/entrypoints/cli/openai.py

    
                  return output

              def _print_completion_stream(stream) -> str:

Contributor

gemini-code-assist bot Aug 27, 2025

For consistency and to improve type safety, please add a type hint for the stream parameter. The openai client returns a Stream of Completion objects for completion requests.

You'll need to ensure the necessary types are imported within a TYPE_CHECKING block:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from openai import Stream
    from openai.types import Completion

Suggested change

      
            def _print_completion_stream(stream) -> str:
          
            def _print_completion_stream(stream: "Stream[Completion]") -> str:

vllm/entrypoints/cli/openai.py Outdated

Comment on lines 83 to 86

    
                      stream = client.chat.completions.create(

                          model=model_name, messages=conversation, stream=True)

                      output = _print_chat_stream(stream)

                      conversation.append({"role": "assistant", "content": output})

Contributor

gemini-code-assist bot Aug 27, 2025

The streaming API call can raise exceptions (e.g., openai.APIError) if an issue occurs during generation. To prevent the CLI from crashing and to provide a better user experience, it's best to wrap the streaming logic in a try...except block to gracefully handle any potential errors.

        try:
            stream = client.chat.completions.create(
                model=model_name, messages=conversation, stream=True)
            output = _print_chat_stream(stream)
            conversation.append({"role": "assistant", "content": output})
        except Exception as e:
            print(f"\nAn error occurred: {e}")

vllm/entrypoints/cli/openai.py Outdated

Comment on lines 129 to 132

    
                          stream = client.chat.completions.create(

                              model=model_name, messages=conversation, stream=True)

                          output = _print_chat_stream(stream)

                          conversation.append({"role": "assistant", "content": output})

Contributor

gemini-code-assist bot Aug 27, 2025

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.chat.completions.create(
                    model=model_name, messages=conversation, stream=True)
                output = _print_chat_stream(stream)
                conversation.append({"role": "assistant", "content": output})
            except Exception as e:
                print(f"\nAn error occurred: {e}")

vllm/entrypoints/cli/openai.py Outdated

Comment on lines 143 to 146

    
                          stream = client.chat.completions.create(

                              model=model_name, messages=conversation, stream=True)

                          output = _print_chat_stream(stream)

                          conversation.append({"role": "assistant", "content": output})

Contributor

gemini-code-assist bot Aug 27, 2025

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.chat.completions.create(
                    model=model_name, messages=conversation, stream=True)
                output = _print_chat_stream(stream)
                conversation.append({"role": "assistant", "content": output})
            except Exception as e:
                print(f"\nAn error occurred: {e}")

vllm/entrypoints/cli/openai.py

Comment on lines +186 to +189

    
                          stream = client.completions.create(model=model_name,

                                                             prompt=args.quick,

                                                             stream=True)

                          _print_completion_stream(stream)

Contributor

gemini-code-assist bot Aug 27, 2025

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.completions.create(model=model_name,
                                                   prompt=args.quick,
                                                   stream=True)
                _print_completion_stream(stream)
            except Exception as e:
                print(f"\nAn error occurred: {e}")

vllm/entrypoints/cli/openai.py

Comment on lines +198 to +201

    
                          stream = client.completions.create(model=model_name,

                                                             prompt=input_prompt,

                                                             stream=True)

                          _print_completion_stream(stream)

Contributor

gemini-code-assist bot Aug 27, 2025

The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except block to handle potential errors gracefully.

            try:
                stream = client.completions.create(model=model_name,
                                                   prompt=input_prompt,
                                                   stream=True)
                _print_completion_stream(stream)
            except Exception as e:
                print(f"\nAn error occurred: {e}")

simon-mo added 2 commits

September 17, 2025 15:57


          Merge branch 'main' of github.com:vllm-project/vllm into codex/enable…

4c956ee

…-streaming-for-vllm-completechat


          lint

70fcd69

Signed-off-by: simon-mo <simon.mo@hey.com>

simon-mo requested a review from chaunceyjiang as a code owner

September 17, 2025 23:13

simon-mo changed the title ~~Enable streaming in CLI chat and completion commands~~ [CLI] Use streaming in CLI chat and completion commands

simon-mo merged commit e111d5b into main

18 checks passed

simon-mo deleted the codex/enable-streaming-for-vllm-completechat branch

September 18, 2025 05:30

Collaborator Author

simon-mo commented Sep 18, 2025

@chaunceyjiang sorry just saw the comments. i was mostly looking at the CLI file don't really have much type hint and is designed to be simple demo code.

845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request


          Merge branch 'model_register' of https://github.com/dsxsteven/vllm_sp…

e38137d

…litPR into model_register

* 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits)
  Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085)
  [Docs] Fix API Reference (vllm-project#25140)
  [Kernel] Better inf handling for grouped topk cu (vllm-project#24886)
  [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769)
  [benchmark] add peak throughput metrics and plot (vllm-project#23867)
  [Spec Decode] Efficient padded speculation (vllm-project#24539)
  [V0 Deprecation] Remove more V0 tests (vllm-project#25117)
  [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078)
  [XPU] Whisper model support on XPU Platform (vllm-project#25123)
  Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077)
  [Model] enable data parallel for InternVL vision encoder (vllm-project#23909)
  [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254)
  [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960)
  [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006)
  [Docs] Clean up the contributing README (vllm-project#25099)
  [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955)
  [Kernels] Enable DeepGEMM by default (vllm-project#24462)
  [V0 Deprecation] Skip PP test (vllm-project#25128)
  [V0 Deprecation] Remove misc V0 tests (vllm-project#25118)
  [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115)
  ...

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request


          [CLI] Use streaming in CLI chat and completion commands (vllm-project…

f30529e

…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request


          [CLI] Use streaming in CLI chat and completion commands (vllm-project…

4e08a02

…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request


          [CLI] Use streaming in CLI chat and completion commands (vllm-project…

06c1a99

…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: charlifu <charlifu@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request


          [CLI] Use streaming in CLI chat and completion commands (vllm-project…

75f078f

…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request


          [CLI] Use streaming in CLI chat and completion commands (vllm-project…

1cfb7ce

…#23769)

Signed-off-by: simon-mo <simon.mo@hey.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

aarnphm Awaiting requested review from aarnphm aarnphm is a code owner

chaunceyjiang Awaiting requested review from chaunceyjiang chaunceyjiang is a code owner

1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels