-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subsequent prompts are around 10x to 12x slower than on llama.cpp "main" example. #181
Comments
@Firstbober as in you're using the low level api to implement a python version of the main chat example, do you mind sharing a snippet. Also I'm not sure if you've tried @SagsMug example is it also slower? |
I am using high-level API, but I will tomorrow compare if the example from @SagsMug is also slower. Thanks for letting me know! |
@Firstbober check out this discussion #49 (comment) pyspy should be able to tell you where the slowdown is coming from, please post the svg if possible. |
From this test, I can tell there really is not difference in token generation speed. In my project, I call the high-level API every time a new message comes. After reading and inserting prints in the code of https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_llama_cpp.py I found that initial I am not sure if this is entirely a problem of the high-level API or just intended behavior, but it seems that it resets most of the things in the memory every twitter-10-05-2023.webm |
Hello, how can I run it without the prefix just API mode? |
I achieved my goal. Now subsequent prompts in the same, previously given context, don't slow down the generation that much. I can't really tell the difference from the llama.cpp “main” now! For starters, I call my own function I still have to test some stuff to make sure that I am not leaving any performance on the able, but I am still pretty content with the result. Also, I just checked if the high-level API can do the same thing without passing the entire context as a prompt. It turns out it can't, which makes me sad and grateful at the same time. |
@Firstbober did you make it exactly same as main.cpp? Can u please share your python code? |
I will upload it to GitHub in the upcoming days when I integrate everything into my project. |
@Firstbober can u indicates me which part did u changed? |
I just call llama_eval on all the context tokens and then cache everything else in my class. The core of my |
@Firstbober thank u sir |
Oh interesting. I wonder if this is related to oobabooga/text-generation-webui#2088 |
@AlphaAtlas not sure if they're related as this issue was from before the CUDA offloading merge. Do you mind opening a new issue here, if there's a performance discrepency and you don't mind giving me a hand getting to the bottom of it I'm very interested in fixing it. |
@abetlen Yeah. I need to do some stuff, but then I will more formally test llama.cpp vs llama-cpp-python with the profiles like that ^, including tokens/sec for the same prompt, and post an issue. |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
i'm still seeing this issue on CPU inference. even though there is a prefix match, i believe the model's latest response needs to be reparsed, which is not the case when using llama.cpp/main or llama.cpp/server. i've even confirmed it is fast with llama-cpp-python example server. |
@abetlen do you think it is feasible to make the high level API do something similar to https://gist.github.com/Firstbober/d7f97e7f743a973c14425424e360eeda ? it seems like this could improve performance of oodabooga as well, which uses the high level API. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I am creating a simple clone of the "main" example from the llama.cpp repo, which involves interactive mode with really fast interference of around 36 ms per token.
Current Behavior
Generating the first token takes around 10–12 seconds and then subsequent ones take around 200-300 ms. It should match the speed of the example from the llama.cpp repo.
Environment and Context
I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence.
Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: