Can i stop execution? (w/ `decoupled mode`) #162

Yeom · 2023-08-21T00:48:26Z

Description

Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100

How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.


Reference
- https://github.com/triton-inference-server/server/issues/4344
- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646

Reproduced Steps

shanekong · 2023-09-12T08:14:11Z

i meet a similar problem.
if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.

client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.

is there any way to get out?

Yeom added the bug Something isn't working label Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can i stop execution? (w/ `decoupled mode`) #162

Can i stop execution? (w/ `decoupled mode`) #162

Yeom commented Aug 21, 2023

shanekong commented Sep 12, 2023

Can i stop execution? (w/ decoupled mode) #162

Can i stop execution? (w/ decoupled mode) #162

Comments

Yeom commented Aug 21, 2023

Description

Reproduced Steps

shanekong commented Sep 12, 2023

Can i stop execution? (w/ `decoupled mode`) #162

Can i stop execution? (w/ `decoupled mode`) #162