[Question] Why read generation config in every decode step? #2150

gesanqiu · 2024-04-17T07:53:17Z

❓ General Questions

In every DecodeStep(), it call SampleTokenFromLogits() to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU performance. On my Jetson AGX Orin 64GB, it will take about 20ms to read out just 6 variables, while only 6ms for a 3B model forwarding. Since it will be called in every decode step, it become a static cost.
My question is in which case we will need different generation configured parameters, since I always use same generation configured parameters for a single request.

BTW, according to nativejson-benchmark picojson is not the SOTA cpp json library, even not a first-class one.

The text was updated successfully, but these errors were encountered:

MasterJH5574 · 2024-04-21T02:16:18Z

Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.

gesanqiu · 2024-04-21T03:03:52Z

Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.

Update: My MLC-LLM repo is still in February 21th, my colleague told me that MLC-LLM already support concurrent generation requests right now in server, I will spend some time on this module.

Thanks for you @MasterJH5574 reply, after further investigation, I found that mlc-llm doesn't have a scheduler to manager concurrent generation requests(assuming these requests have different generation config), so if we want to get right sampling request for every separate generation request, llm_chat object should read generation config in every forward procedure.

I already put all the generation parameters into ReadGenerationConfig() and call it in PrefillStep() only once to set up them.

gesanqiu · 2024-04-25T01:42:42Z

@MasterJH5574 Any progress with this issue? I'd love to fix this issue, but I think we need a more robust design, like other framework, we usually need a SamplingParams class. If you have any ideas, please let me know.

tqchen · 2024-05-11T03:14:15Z

the latest MLCEngine should support concurrent generation and read config ones, see #2217

gesanqiu added the question Question about the usage label Apr 17, 2024

tqchen closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why read generation config in every decode step? #2150

[Question] Why read generation config in every decode step? #2150

gesanqiu commented Apr 17, 2024 •

edited

Loading

MasterJH5574 commented Apr 21, 2024

gesanqiu commented Apr 21, 2024 •

edited

Loading

gesanqiu commented Apr 25, 2024

tqchen commented May 11, 2024

[Question] Why read generation config in every decode step? #2150

[Question] Why read generation config in every decode step? #2150

Comments

gesanqiu commented Apr 17, 2024 • edited Loading

❓ General Questions

MasterJH5574 commented Apr 21, 2024

gesanqiu commented Apr 21, 2024 • edited Loading

gesanqiu commented Apr 25, 2024

tqchen commented May 11, 2024

gesanqiu commented Apr 17, 2024 •

edited

Loading

gesanqiu commented Apr 21, 2024 •

edited

Loading