-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Why read generation config in every decode step? #2150
Comments
Thank you @gesanqiu for reporting this finding. My first impression is likely this is a bug and needs a fix. We will discuss and see how we can address this. We do not need to read the config every time. |
Update: My MLC-LLM repo is still in February 21th, my colleague told me that MLC-LLM already support concurrent generation requests right now in server, I will spend some time on this module. Thanks for you @MasterJH5574 reply, after further investigation, I found that mlc-llm doesn't have a scheduler to manager concurrent generation requests(assuming these requests have different generation config), so if we want to get right sampling request for every separate generation request, llm_chat object should read generation config in every forward procedure. I already put all the generation parameters into |
@MasterJH5574 Any progress with this issue? I'd love to fix this issue, but I think we need a more robust design, like other framework, we usually need a SamplingParams class. If you have any ideas, please let me know. |
the latest MLCEngine should support concurrent generation and read config ones, see #2217 |
❓ General Questions
In every DecodeStep(), it call SampleTokenFromLogits() to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU performance. On my Jetson AGX Orin 64GB, it will take about 20ms to read out just 6 variables, while only 6ms for a 3B model forwarding. Since it will be called in every decode step, it become a static cost.
My question is in which case we will need different generation configured parameters, since I always use same generation configured parameters for a single request.
BTW, according to nativejson-benchmark
picojson
is not the SOTA cpp json library, even not a first-class one.The text was updated successfully, but these errors were encountered: