Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsequent prompts are around 10x to 12x slower than on llama.cpp "main" example. #181

Closed
4 tasks done
Firstbober opened this issue May 10, 2023 · 17 comments
Closed
4 tasks done

Comments

@Firstbober
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am creating a simple clone of the "main" example from the llama.cpp repo, which involves interactive mode with really fast interference of around 36 ms per token.

Current Behavior

Generating the first token takes around 10–12 seconds and then subsequent ones take around 200-300 ms. It should match the speed of the example from the llama.cpp repo.

Environment and Context

I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence.

  • AMD Ryzen 5 3600 6-Core Processor + RX 580 4 GB
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 3600 6-Core Processor
    CPU family:          23
    Model:               113
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  94%
    CPU max MHz:         4208,2031
    CPU min MHz:         2200,0000
    BogoMIPS:            7186,94
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
                         x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
                         od nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl p
                         ni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe
                          popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy sv
                         m extapic cr8_legacy abm sse4a misalignsse 3dnowprefetc
                         h osvw ibs skinit wdt tce topoext perfctr_core perfctr_
                         nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate
                          ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bm
                         i2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsa
                         veopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_tota
                         l cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd
                          arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
                         flushbyasid decodeassists pausefilter pfthreshold avic 
                         v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_re
                         cov succor smca sev sev_es

Number of devices                                 1
  Device Name                                     gfx803
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 
  Driver Version                                  3513.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         AMD Radeon RX 580 Series
  • Linux:

Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux

  • Versions:
Python 3.11.3

GNU Make 4.4.1
Built for x86_64-pc-linux-gnu

g++ (GCC) 13.1.1 20230429
@abetlen
Copy link
Owner

abetlen commented May 10, 2023

@Firstbober as in you're using the low level api to implement a python version of the main chat example, do you mind sharing a snippet.

Also I'm not sure if you've tried @SagsMug example is it also slower?

@Firstbober
Copy link
Author

I am using high-level API, but I will tomorrow compare if the example from @SagsMug is also slower. Thanks for letting me know!

@abetlen
Copy link
Owner

abetlen commented May 10, 2023

@Firstbober check out this discussion #49 (comment) pyspy should be able to tell you where the slowdown is coming from, please post the svg if possible.

@Firstbober
Copy link
Author

High level:
record_hl

Low level:
record_ll

From this test, I can tell there really is not difference in token generation speed. In my project, I call the high-level API every time a new message comes. After reading and inserting prints in the code of https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_llama_cpp.py I found that initial llama_eval takes most of the time and the token generation is in check with the native ./main from llama.cpp.

I am not sure if this is entirely a problem of the high-level API or just intended behavior, but it seems that it resets most of the things in the memory every __call__ and ./main doesn't. This would explain faster subsequent responses.

twitter-10-05-2023.webm

@lucasjinreal
Copy link

Hello, how can I run it without the prefix just API mode?

@Firstbober
Copy link
Author

I achieved my goal. Now subsequent prompts in the same, previously given context, don't slow down the generation that much. I can't really tell the difference from the llama.cpp “main” now!

For starters, I call my own function load_context which takes an initial prompt, calls llama_eval on it and adds to n_past tokens.
Then we have the generate function which takes the prompt, call llama_eval fully again only if all tokens accumulated through multiple calls exceed the amount of n_ctx. Other than that, I only call llama_eval on the prompt and the tokens that are sampled through llama_sample_token. I took much of the code from https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_llama_cpp.py.

I still have to test some stuff to make sure that I am not leaving any performance on the able, but I am still pretty content with the result. Also, I just checked if the high-level API can do the same thing without passing the entire context as a prompt. It turns out it can't, which makes me sad and grateful at the same time.

@lucasjinreal
Copy link

@Firstbober did you make it exactly same as main.cpp? Can u please share your python code?

@Firstbober
Copy link
Author

I will upload it to GitHub in the upcoming days when I integrate everything into my project.

@lucasjinreal
Copy link

@Firstbober can u indicates me which part did u changed?

@Firstbober
Copy link
Author

I just call llama_eval on all the context tokens and then cache everything else in my class. The core of my generate function is nearly the same as the one in the examples. It still has some tweaks I need to make. I will message you when I upload my code.

@lucasjinreal
Copy link

@Firstbober thank u sir

@Firstbober
Copy link
Author

@AlphaAtlas
Copy link

Oh interesting. I wonder if this is related to oobabooga/text-generation-webui#2088

@abetlen
Copy link
Owner

abetlen commented May 15, 2023

@AlphaAtlas not sure if they're related as this issue was from before the CUDA offloading merge.

Do you mind opening a new issue here, if there's a performance discrepency and you don't mind giving me a hand getting to the bottom of it I'm very interested in fixing it.

@AlphaAtlas
Copy link

@abetlen Yeah. I need to do some stuff, but then I will more formally test llama.cpp vs llama-cpp-python with the profiles like that ^, including tokens/sec for the same prompt, and post an issue.

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@khimaros
Copy link
Contributor

i'm still seeing this issue on CPU inference. even though there is a prefix match, i believe the model's latest response needs to be reparsed, which is not the case when using llama.cpp/main or llama.cpp/server. i've even confirmed it is fast with llama-cpp-python example server.

@khimaros
Copy link
Contributor

khimaros commented May 8, 2024

@abetlen do you think it is feasible to make the high level API do something similar to https://gist.github.com/Firstbober/d7f97e7f743a973c14425424e360eeda ? it seems like this could improve performance of oodabooga as well, which uses the high level API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants