-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[CLI] Use streaming in CLI chat and completion commands #23769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces streaming support for the chat
and complete
CLI commands, which is a great enhancement for user experience. The implementation is straightforward, using helper functions to handle the streaming logic. My review includes suggestions to improve robustness by adding error handling around the streaming API calls and to enhance code clarity by adding type hints to the new helper functions. These changes will make the CLI tool more resilient and the code easier to maintain.
return model_name, openai_client | ||
|
||
|
||
def _print_chat_stream(stream) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve type safety and code readability, please add a type hint for the stream
parameter. The openai
client returns a Stream
of ChatCompletionChunk
objects. Using a string forward reference for the type hint is a good practice here.
You'll need to ensure the necessary types are imported within a TYPE_CHECKING
block:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from openai import Stream
from openai.types.chat import ChatCompletionChunk
def _print_chat_stream(stream) -> str: | |
def _print_chat_stream(stream: "Stream[ChatCompletionChunk]") -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
return output | ||
|
||
|
||
def _print_completion_stream(stream) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and to improve type safety, please add a type hint for the stream
parameter. The openai
client returns a Stream
of Completion
objects for completion requests.
You'll need to ensure the necessary types are imported within a TYPE_CHECKING
block:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from openai import Stream
from openai.types import Completion
def _print_completion_stream(stream) -> str: | |
def _print_completion_stream(stream: "Stream[Completion]") -> str: |
vllm/entrypoints/cli/openai.py
Outdated
stream = client.chat.completions.create( | ||
model=model_name, messages=conversation, stream=True) | ||
output = _print_chat_stream(stream) | ||
conversation.append({"role": "assistant", "content": output}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming API call can raise exceptions (e.g., openai.APIError
) if an issue occurs during generation. To prevent the CLI from crashing and to provide a better user experience, it's best to wrap the streaming logic in a try...except
block to gracefully handle any potential errors.
try:
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
except Exception as e:
print(f"\nAn error occurred: {e}")
vllm/entrypoints/cli/openai.py
Outdated
stream = client.chat.completions.create( | ||
model=model_name, messages=conversation, stream=True) | ||
output = _print_chat_stream(stream) | ||
conversation.append({"role": "assistant", "content": output}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except
block to handle potential errors gracefully.
try:
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
except Exception as e:
print(f"\nAn error occurred: {e}")
vllm/entrypoints/cli/openai.py
Outdated
stream = client.chat.completions.create( | ||
model=model_name, messages=conversation, stream=True) | ||
output = _print_chat_stream(stream) | ||
conversation.append({"role": "assistant", "content": output}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except
block to handle potential errors gracefully.
try:
stream = client.chat.completions.create(
model=model_name, messages=conversation, stream=True)
output = _print_chat_stream(stream)
conversation.append({"role": "assistant", "content": output})
except Exception as e:
print(f"\nAn error occurred: {e}")
stream = client.completions.create(model=model_name, | ||
prompt=args.quick, | ||
stream=True) | ||
_print_completion_stream(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except
block to handle potential errors gracefully.
try:
stream = client.completions.create(model=model_name,
prompt=args.quick,
stream=True)
_print_completion_stream(stream)
except Exception as e:
print(f"\nAn error occurred: {e}")
stream = client.completions.create(model=model_name, | ||
prompt=input_prompt, | ||
stream=True) | ||
_print_completion_stream(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming API call can raise exceptions. To make the CLI more robust, please wrap this call in a try...except
block to handle potential errors gracefully.
try:
stream = client.completions.create(model=model_name,
prompt=input_prompt,
stream=True)
_print_completion_stream(stream)
except Exception as e:
print(f"\nAn error occurred: {e}")
…-streaming-for-vllm-completechat
@chaunceyjiang sorry just saw the comments. i was mostly looking at the CLI file don't really have much type hint and is designed to be simple demo code. |
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: charlifu <charlifu@amd.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…#23769) Signed-off-by: simon-mo <simon.mo@hey.com>
Summary
vllm chat
CLIvllm complete
CLITesting
ruff check vllm/entrypoints/cli/openai.py
python -m py_compile vllm/entrypoints/cli/openai.py
pre-commit run --files vllm/entrypoints/cli/openai.py
(fails: command not found)pytest tests/v1/entrypoints/openai/test_completion.py::test_completion_streaming -q
(fails: ModuleNotFoundError: torch)https://chatgpt.com/codex/tasks/task_e_68af54ff5ee88329b50c13bf46c0da0d