-
Notifications
You must be signed in to change notification settings - Fork 21
Slowdown with tokens #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See ggml-org/ggml#231 (comment) for a hint about how to fix it. According to Mr. GG you'd need to do something similar to ggml-org#775 |
It sounds like @jploski is already a bit on track, I'll keep my thumbs pressed in that regard |
Is it related to the issue described in this comment: https://news.ycombinator.com/item?id=36377873 ? "The primary issue is that the current upstream version (on Huggingface) hasn't implemented key-value caching correctly. KV caching is needed to bring the complexity down from O(n^3) to O(n^2). The issues are: (1) their implementation uses Torch' scaled dot-product attention, which uses incorrect causal masks when the query/key sizes are not the same (which it the case when generating with a cache). (2) They don't index the rotary embeddings correctly when using key-value cache, so the rotary embedding of the first token is used for all generated tokens. Together, this causes the model to output garbage and it only works when using it without KV caching, making it very slow." |
With my latest commit performance is almost doubled at higher context windows. This is still not the goal but it's a good step ahead. We need the mat multiplication to access the same K and V tensors without the need of repeating. |
I believe this terrible slowdown is caused by the incorrect utilization of the KV cache. As it stands with every new token all the previous tokens' query and key vectors are calculated and multiplied by each other, leading to quadratically increasing computation time. But I believe it would be sufficient to just process only the last token (or more generally, last N, taking prompt processing and batched token generation into consideration) and retrieve all the previous ones from cache. |
Yes, I think it's topically related (and the problem already being present in the upstream Python version would explain why I also got it wrong in the translation, apart from my ignorance). But I suspect there are also some tangential concerns in that thread related to HuggingFace's KV cache implementation that are not actually relevant to fixing it in GGML. |
A part of the slowdown was fixed yesterday, it's twice as fast as before now.
Last token, last layer and total sum of operations:
And for comparison, this is without the patch:
The largest slowdown came from MUL_MAT, followed by CONT (131ms for token 300) then REPEAT2 CONT is a minor problem now but I also looked into caching of KV, I bet we can cache it more efficiently. |
@jploski took the time figuring out the broadcasting PR to work with the interleaving repeat type multiplication That's a 1077 token generation on Falcon 40B q6_k:
The branch is not completely ready. It's still a mess with ggml differences and probably bugged in 10 places, but I need 8h of sleep Really good job Jan :) |
Scratch that - it's already working as it should - the query vectors are only retrieved from cache for the new token(s), and they are multiplied by the key vectors of all the past tokens. So there is no redundant computation going on. I got myself confused by comparing the KQ product shapes for prompt lengths n and n+1, but it is to be expected that KQ product shape increases quadratically with the initial prompt length, and it's a one-time operation. It remains to check if the slowdown is simply because of the increasing context size (as stated above, linearly more key vectors to multiply by with each new token), or if there is still some other issue, but at this moment I see none. I just committed a small fix (to my GGML falcon40b branch and to ggllm.cpp master), which eliminates an unnecessary 3d reshape op, but it did not substantially improve performance. |
Okay, I think the next real big step is to move more into CUDA. I've spent good 10 hours since yesterday, what looked like a quick integration turned into a multi-layered nightmare. I ended up rewriting the entire quantization tensors from 32 bit to 16 bit, "hacking" the cublas function to split into a 16 bit cuBLAS. When I was finished saving 50% of the VRAM I noticed the tensor shape :-) Well, the good news is that the new kernels appear to work. cuBLAS operates in 16 bit now and the vram buffer management is also optimized. Ah and benchmarks: The PR is merged into Master now. |
Ah yes, "just one more little tweak" (turns into all night debugging session) seems to be an "emergent feature" of the large language models. ;-) Anyhow, the only big slowdown I'm seeing now is during the context swap. After generating at a lightning pace it stops for multiple seconds, then resumes. It's not Falcon-related, just a side effect of llama.cpp's original "recompute half-the-context on swap" approach. I wonder whether this could be optimized away as well by modelling the kv cache memory as a ring buffer... but this would probably require some changes (at least) to the rope implementation as well to support "wrap-around" position indexes. |
That sounds like a quite cool idea, probably also better for consistentcy as the ringbuffer will just gradually change and not a instantaneous swap. |
@jploski I just pushed a hotfix which disabled broadcasting during batching (back to repeating), it must be some nasty problem in the tensor shapes. Here is a test case: |
What do you mean by "fails"? I just tried the pre-hotfix commit d94c88d using a q5_1 version of 7B (and use --override-max-gpu 1 - is that what you meant?). It produced roughly the same output with -b 31 -b 32 and -b 33 (I think the differences are because of cuBLAS). Edit: nevermind, I see the issue (garbage generated with large batch size) |
@jploski I ran good 100 tests on various models and they always lose context with broadcast, I tested 40B and 7B, 16 bit , normal and K quants all the same issue. In that test case with falcon-7b/q4_1:
I think I ran a hundred tests, almost all of them with the same problem, when the prompt is large enough and it's processed in batch mode the whole context is lost. Sometimes I just has "DDDDDDDDDDD" as result, usually it was language. Edit: I see you noticed it. I was really worried it only happens locally :) |
I can see it even with the falcon-mini-shakespeare model. It seems to depend not just on the batch size, but also on the prompt length. Specifically with prompt length of 32 tokens I see the problem appearing with -b 32 (not -b 31). But removing a single token from the prompt (making it 31 tokens) makes -b 32 (and higher) work again. My test case:
|
Yes it's starting at "-b 32". Just if you have a prompt lower than 32 then you can not batch process it with 32, so that automatically reduces the n_batch down to the prompt length. |
BTW I noticed Makefile generated with -DLLAMA_CUBLAS=0 does not compile, possibly another problem. |
Appears to work for me, can you show me the error ?
|
Sorry, it also seems to have something to do with Debug build type (fails only with both options):
|
@jploski just pushed a bugfix for it |
I note it fails in the same way (judging by model output) even with the "interleaved" broadcast. |
Thanks - I noticed that with -DLLAMA_CUBLAS=0 the -b 32 problem is not reproducible. |
daaaamn ! It does not really answer why this fails but at least we know where! |
Yes, that might be it. I noticed that the mini-model fails with -DLLAMA_CUBLAS=1 even if I disable broadcasting and use ggml_repeat2 exclusively.
For the record, I also checked with valgrind/memcheck, which did not detect any illegal memory accesses. To make valgrind work I had to use
|
Glad to hear about valgrind, I didn't use that for more than a decade. I was suspecting a problem of that sort. I'm getting a ggml dequantizer segfault when adding return false; into ggml_cuda_compute_forward() (cuda flag enabled) |
I just pushed an update patch. I also experimented with prompts from 10-500 tokens and CPU based broadcasting was always faster than repeat + cuBLAS. Thanks for your debugging, I was not aware about the indirect jump into cuBLAS (in ggml.c that's a kind of unclean indirection that skips the whole ggml_cuda_operation loop). Now switching GPUs off is really switching them off for all operations. |
With each token processed the inference speed slows down a little bit, starts to become noticeable at around 50 tokens on 40B Q3_K and adds up.
The text was updated successfully, but these errors were encountered: