Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Batch Index Order in Decoupled Mode with trt-llm #2777

Open
2 of 4 tasks
Oldpan opened this issue Feb 12, 2025 · 8 comments
Open
2 of 4 tasks

Inconsistent Batch Index Order in Decoupled Mode with trt-llm #2777

Oldpan opened this issue Feb 12, 2025 · 8 comments
Labels
bug Something isn't working

Comments

@Oldpan
Copy link

Oldpan commented Feb 12, 2025

System Info

Question:

While using trt-llm (tensorrt_llm 0.17.0.dev2024121700 + Triton) with decoupled mode enabled and a batch size greater than 1, I observed an issue where the batch_index in the returned data does not always match the expected order of inputs.

For example, if I input [A, B, C] in a batch(batchsize=3), I expect the model to return [a, b, c] in the same order. However, in some cases, the output batch indices get shuffled:

data: {"batch_index":2, "model_name":"pinyin_nougat", "model_version":"1", "sequence_end":false, "sequence_id":0, "sequence_start":false, "text_output":"b"}
data: {"batch_index":1, "model_name":"pinyin_nougat", "model_version":"1", "sequence_end":false, "sequence_id":0, "sequence_start":false, "text_output":"c"}
data: {"batch_index":0, "model_name":"pinyin_nougat", "model_version":"1", "sequence_end":false, "sequence_id":0, "sequence_start":false, "text_output":"a"}

Here, the batch_index is incorrect, leading to an unexpected order of results.

Is this behavior expected in decoupled mode, or is there a way to ensure the output follows the correct sequence order?

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

none

Expected behavior

right order

actual behavior

wrong order

additional notes

none

@Oldpan Oldpan added the bug Something isn't working label Feb 12, 2025
@MahmoudAshraf97
Copy link
Contributor

I guess it returns by the order of first completed, it doesn't state anywhere that the return order is guaranteed to be the same as the input order

@Oldpan
Copy link
Author

Oldpan commented Feb 12, 2025

I guess it returns by the order of first completed, it doesn't state anywhere that the return order is guaranteed to be the same as the input order

Does this mean we can only send one request at a time from the client? And we must also batch the requests on the trt-llm side?

@MahmoudAshraf97
Copy link
Contributor

you can send requests as usual, just make sure you either sort them or assert that the result that you choose out of all results is the one that you want, don't assume that they are ordered

@Oldpan
Copy link
Author

Oldpan commented Feb 12, 2025

you can send requests as usual, just make sure you either sort them or assert that the result that you choose out of all results is the one that you want, don't assume that they are ordered你可以像往常一样发送请求,只是确保你要么对请求进行排序,要么断言你从中选择的结果是你想要的,不要假设它们是有序的

Thanks for the reply.
Do I have to send the requests one by one in order? My scenario involves a VLM model, where an encoder is used, and batching is most efficient. After the encoder, a batch of features and the corresponding input_ids are passed into tensorrt_llm together. At this point, the batch of requests is sent simultaneously. So, I need the batch results from trt_llm to match the inputs.

@MahmoudAshraf97
Copy link
Contributor

MahmoudAshraf97 commented Feb 12, 2025

I guess TRT-LLM already supports VLM in the executor API, so you don't need to have two separate models and handle the batching yourself

@Oldpan
Copy link
Author

Oldpan commented Feb 12, 2025

@MahmoudAshraf97
To my knowledge, VLM, or multimodal, has not yet been integrated into the executor. In the triton-tensorrt-llm-backend, the encoder model runs here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/multimodal/multimodal_encoders/1/model.py. What you mentioned about being integrated into the executor might refer to the enc-dec
If I’m wrong, please correct me...

@MahmoudAshraf97
Copy link
Contributor

I did try executor API with enc-dec speech models, and it accepts encoder input, encoder input ids and decoder input ids, so I don't see why VLMs might not be supported, I don't know about triton tbh.

@Oldpan
Copy link
Author

Oldpan commented Feb 13, 2025

I'll try to find a way around this issue. If I make any progress, I'll update here.
@byshiue Can you help me check this issue? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants