Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Why read generation config in every decode step? #2150

Closed
gesanqiu opened this issue Apr 17, 2024 · 4 comments
Closed

[Question] Why read generation config in every decode step? #2150

gesanqiu opened this issue Apr 17, 2024 · 4 comments
Labels
question Question about the usage

Comments

@gesanqiu
Copy link
Contributor

gesanqiu commented Apr 17, 2024

❓ General Questions

In every DecodeStep(), it call SampleTokenFromLogits() to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU performance. On my Jetson AGX Orin 64GB, it will take about 20ms to read out just 6 variables, while only 6ms for a 3B model forwarding. Since it will be called in every decode step, it become a static cost.
My question is in which case we will need different generation configured parameters, since I always use same generation configured parameters for a single request.

BTW, according to nativejson-benchmark picojson is not the SOTA cpp json library, even not a first-class one.

@gesanqiu gesanqiu added the question Question about the usage label Apr 17, 2024
@MasterJH5574
Copy link
Member

Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.

@gesanqiu
Copy link
Contributor Author

gesanqiu commented Apr 21, 2024

Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time.

Update: My MLC-LLM repo is still in February 21th, my colleague told me that MLC-LLM already support concurrent generation requests right now in server, I will spend some time on this module.

Thanks for you @MasterJH5574 reply, after further investigation, I found that mlc-llm doesn't have a scheduler to manager concurrent generation requests(assuming these requests have different generation config), so if we want to get right sampling request for every separate generation request, llm_chat object should read generation config in every forward procedure.

I already put all the generation parameters into ReadGenerationConfig() and call it in PrefillStep() only once to set up them.

@gesanqiu
Copy link
Contributor Author

@MasterJH5574 Any progress with this issue? I'd love to fix this issue, but I think we need a more robust design, like other framework, we usually need a SamplingParams class. If you have any ideas, please let me know.

@tqchen
Copy link
Contributor

tqchen commented May 11, 2024

the latest MLCEngine should support concurrent generation and read config ones, see #2217

@tqchen tqchen closed this as completed May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

3 participants