You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100
How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.Reference- https://github.com/triton-inference-server/server/issues/4344- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646
Reproduced Steps
-
The text was updated successfully, but these errors were encountered:
i meet a similar problem.
if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.
client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.
Description
Reproduced Steps
The text was updated successfully, but these errors were encountered: