fix(llm): support Responses streaming via on_token#1761
fix(llm): support Responses streaming via on_token#1761xingyaoww merged 6 commits intofeat/openai-subscription-authfrom
Conversation
Enable streaming for LLM.responses() when requested (stream or LLM.stream). When LiteLLM returns a Responses streaming iterator, drain it, forward best-effort text deltas to on_token (as ModelResponseStream chunks), and return the completed ResponsesAPIResponse.
Avoid getattr() by relying on LiteLLM's ResponsesAPIStreamingResponse event types (e.g., OutputTextDeltaEvent) and ResponseCompletedEvent.
all-hands-bot
left a comment
There was a problem hiding this comment.
The streaming implementation looks solid overall! I found three areas worth discussing: a redundant condition check, handling of non-streaming responses when streaming is requested, and error handling for the callback. Details in inline comments below.
There was a problem hiding this comment.
🟡 Suggestion: The on_token callback is invoked without error handling. If the callback raises an exception, the entire streaming response would fail and the completed response would be lost.
Consider wrapping the on_token call in a try-except block to make the system more robust:
try:
on_token(
ModelResponseStream(
choices=[
StreamingChoices(
delta=Delta(content=delta)
)
]
)
)
except Exception as e:
# Log the error but don't fail the entire request
logger.warning(f"on_token callback failed: {e}")This way, a faulty callback won't prevent the agent from receiving the completed response.
There was a problem hiding this comment.
Good point. For now I’m intentionally not swallowing on_token callback exceptions so behavior matches the existing Chat Completions streaming path (where callback exceptions currently propagate).
If we want “best-effort streaming” (callback failures don’t abort the request), I’d prefer to implement that consistently for both LLM.completion() and LLM.responses() (and log a warning / rate-limit), rather than making Responses special-cased.
Avoid per-call imports in the Responses streaming path; rely on LiteLLM's typed event classes.
In subscription mode, Responses stream=True can be forced by options even when the caller didn’t request streaming, so keep gating on user_enable_streaming but avoid redundant checks inside the loop.
If streaming was requested but LiteLLM returns a non-streaming ResponsesAPIResponse, emit a warning so behavior is explicit.
Enable LLM stream mode and wire a simple token callback so users can see incremental output when using ChatGPT subscription login.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
fef7d02
into
feat/openai-subscription-auth

Follow-up to #1682.
LLM.responses()to run withstream=truewhen requested.on_tokenand return the completedResponsesAPIResponseso agent execution can continue.Tested locally by running
examples/01_standalone_sdk/34_subscription_login.pyin subscription mode.