Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can i stop execution? (w/ decoupled mode) #162

Open
Yeom opened this issue Aug 21, 2023 · 1 comment
Open

Can i stop execution? (w/ decoupled mode) #162

Yeom opened this issue Aug 21, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@Yeom
Copy link

Yeom commented Aug 21, 2023

Description

Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100

How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.


Reference
- https://github.com/triton-inference-server/server/issues/4344
- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646

Reproduced Steps

-
@Yeom Yeom added the bug Something isn't working label Aug 21, 2023
@shanekong
Copy link

i meet a similar problem.
if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.

client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.

is there any way to get out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants