-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longer and infinite output #71
Comments
Depending on how much memory you have you can increase the context size to get longer outputs. On a 64gb machine I was able to have a 12k context with the 7B model and 2k context with the 65B model. You can change it here |
Huh... I thought the context size was determined when the model was trained due to the positional encoding used. (I am only a layman.) But #78 is still useful for when you eventually hit the limit of your context, right? |
When trying large contexts, I often encounter
I played around a bit with increasing ctx_size but that did not work, I suspect an underlying memory UB as the cause, as lldb seems to trap on some suspicious memory accesses |
Your link goes to this code snippet |
@drewcrawford for me, that error doesn't appear to be context size related. I've run the same prompt at different context sizes and they all fail. |
Typically if you get the not enough space in the context error you tried setting the context too large, though on the larger models I have had to tweak this line too. The math around memory allocation in this thing isn't perfectly scaling on the larger models and unfortunately my fork has substantially diverged from master and too lazy to work on merging.
8192 context size on quantized 13B model |
Yes, the math for computing necessary memory for the |
What's the best way to enable infinite output? Can we just shift-out the old contexts in K and V tensors (along the n_ctx dim) when they are full, or is there a better approach? |
Hmm, I think yes - we need to shift the KV cache. I haven't implemented this yet in any of the |
The 65B model sometimes crashes with 2k, but usually works ...
The increased context doesn't increase memory use (still under 41GB) - until the context gets filled. Yet, on the 7B model the calculation overflow at the start for 16k or greater. Usually, 15360 will work. |
https://github.com/eous/llama.cpp/commit/e0213e08c03a3ac72cdec4596b872073b51655aa here is some easy stuff I pulled out of my local hackery if anyone wants to play with it |
Btw if anyone wants to slice/dice/refactor/cleanup/dissect/mixup/etc that changeset feel free, I don't need to be credited. |
It's only been 4 days and this is apparently already a dead link. is that because this has already been added or? |
I'm wondering what is happening when the context goes over the 2048 tokens limits (#274). The models seems to break down quickly, in unexpected ways. As a non initiated to the field, I asked ChatGPT, but I'm not sure if I should believe it:
From what I gather, an excessive number of retrieved values are summed up in the output, resulting in a signal loss. However, I find it surprising that the degradation is so fast and severe. What are the strategies for rolling or pruning the contexts? As briefly discussed, the trivial one is using a rolling window (FIFO) that discard tokens older than n_ctx. AFAIK, the contexts in the KV memory are not position (row) dependent, and we should be able to overwrite old contexts by wrapping-around the index. ChatGPT told me that another common strategy uses a LRU priority queue. Querying a context is not a binary but analog process. I guess that some sort of threshold can be used to call when a specific context is queried. Or maybe, derive some sort of score using an exponential smoothing of the past Queries \dot Key similarities. I browsed some of the references provided by ChatGPT on this matter and they seemed to be mostly hallucinated; So I wouldn't trust it. |
Hi @ggerganov! (and anyone else interested 👀) I wanted to reach out to you because I've been working on implementing the idea you mentioned about shifting the KV cache to support infinite output mode in LLaMA. I've run some experiments, and it seems like we might need to rethink the approach. I implementing shifting for the KV cache (and I triple checked everything), but after the window was shifted by a single token, the model started to output garbage. After a lot of testing and frustration, it hit me: positional encoding. I realized that the embeddings stored in the memory_k and memory_v tensors indirectly store positionally encoded data. Therefore, shifting their contents messes with the semantics of those embeddings. While the code in llama_eval computes the values for cur_k and cur_v from the inpL before RoPE is applied, this is only true for the first layer. For every other layer, the contents of inpL are replaced by the output of the previous layer, meaning the tensors for every other layer already contain positional information in an indirect way. I'm not entirely sure if my reasoning is correct, but my results seem to validate the idea that shifting the KV cache might not be enough. I'm reaching out to you (and anyone else who understands this) to see if you agree with my conclusions and to ask if you have any suggestions on how to proceed. I think it would be really beneficial to have a way to slide the context window for LLaMA inference. This could be the key to unlocking a true ChatGPT clone with LLaMA. While having a fixed 2048 token window is a good start, being able to slide that window would enable the self-attention mechanism to remember much more than just the last 2048 tokens. So anyway, sorry for the somewhat incoherent wall of text. My point is, I wanted to reach out because I'm out of ideas here, but I'm eager to hear your thoughts and any suggestions you (or others reading this) might have. 😄 |
@setzer22 I only understand what you're saying in part, but reading your post made me think of something. In abstract terms, would it be possible in some way, instead of having a context that shifts, to have two contexts and do some sort of swap strategy? I'm sure that carries its own issues. Due to my ignorance I can't be more specific, but I'm throwing the idea out there just in case it applies in some way or inspires you to think of something else. |
@setzer22 |
@tjohnman I'm unsure what you mean by a swap strategy here. My first guess is that swapping things out wouldn't work. It sounds similar to clearing the full context window, unless I'm miunsderstanding something. 😅
@ggerganov I've only done a very quick look to the python code (assuming you mean https://github.com/facebookresearch/llama/), but I haven't seen anything referring to a sliding window, so I'm not sure if that's implemented there. |
Can also confirm. Line 19 in e2d490d
Line 207 in 79b2b26
Guess the workaround is to set -n to an extremely high number.
|
Can confirm that on the latest commit b391579 the infinite generation mode with However as my CPU is pretty underwhelming (i5-6600k 4c/4t) , after ~1800 words the generation slows to such a crawl that not a single token gets generated in several minutes. Note that I said words because there isn't a log on how many tokens were generated, I'm thinking of adding an option where interjecting with CTRL+C would print out debug info like the amount of tokens generated so far as with my current hardware I'm not able to do stuff like attach a debugger and run traces as the performance penalty from debugging tools is too big to be able to do anything useful. This PR also looks very interesting for the purposes of debugging. Especially the link between output length and speed degradation could be researched to understand the cause of the bottleneck better. |
sounds like the context swap. |
Yeah, the context swap is not free and causes a pause for me when it happens. Still beats the former behavior and lack thereof. Long starting prompts and large contexts really starts to creep on speed. I never thought I'd want to upgrade my CPU and RAM for CPU-based inference. Really look forward to breakthroughs and "OH" moments. Think a useful strategy for now is to really condense starting prompts so you cover as much as possible with less tokens. |
@rabidcopy |
Oh man. I reinstalled Linux a while back and didn't realize I still had regular |
Do you need to pass any special flags to make or cmake to make this happen? I'm on Linux too, but no matter which packages I install, llama.cpp reports no BLAS support. Do I need my CPU to support it as well or something along those lines? |
make clean
LLAMA_OPENBLAS=1 make |
Thank you. |
Actually, I'm having trouble compiling with OpenBLAS..
Edit: Nevermind, solved. If you're on Arch, install openblas-lapack and replace openblas cblas and lapack. |
Regarding the implementation in #71 (comment), one problem that I noticed is somewhat of a continuity (and consistency) degradation whenever the reset and restart happens, i.e. when the context length was exceeded and the prompt had to be reseeded (with or without Is this due to the internal states (self-attention) not matching up with the previous input anymore, or is it just because the model behaves exactly as if it had been fed with Not being an expert, neither in theory nor in implementation, I was just wondering if a different approach could also be taken: I was thinking of some kind of round-robin approach at the input and output layers: instead of inputting tokens into one "line" with maximum length of the context size that eventually has to be reset and restarted, could we let that "line" wrap around? I do not simply mean shuffling around the next Do this (
Instead of the current approach (where
If I understand it correctly, this would not have a positional issue (since existing state in the model stays where it was). But I have no idea if the model would be able to cope with this situation at all. Essentially, this would be equivalent to the "shift everything to the left" approach when including all self-attention states but without the copying around that comes with it.
Unfortunately, I lack the necessary background to do a plausibility check here but I hope that the general gist of my idea can be understood by someone with the knowledge to tell me if this would be something that could possibly work and whether a draft implementation would be worthwhile. I apologize for my naivety in this post. I'm afraid I have only a rough idea how LLMs works. PS: It seems some more details have been provided over at the llama-rs repository in rustformers/llm#77 (comment):
It may be that this is where I read what I thought about internal self-attention states but I was unable to find it again until now. Then again, since this also doesn't provide a direct solution (but stresses the general idea that it would be useful to somehow preserve the implicit state from earlier inputs that is still stored in the model), I am not sure if this information might be useful here 🙁 |
The idea that you propose is interesting and I think I understand conceptually what you mean. The drawbacks of the existing approach are correctly outlined in the |
Agreed there. It's very hard to know anything for sure and OpenAI isn't going to tell anyone. This is just a suspicion on my end based on interactions with the tool. One think that makes me think they're not using something similar to the swap strategy implemented here is that there's never a clear point in the conversation where a lag spike occurs, but I'm also guessing there are ways to tick users by hiding the latency. And they also seem to pull off other kinds of magic like effortlessly reading through several pages of text and start generating right away in less than a second, so maybe their trick is just having super fast inference 🤷♂️ |
The lack of latency with ChatGPT can be explained by the high memory bandwidth of GPUs. On a GPU, the memory throughput is much higher compared to CPU and since the prompt evaluation is memory bound, I can easily see a high-end GPU being 10-100x times faster than a CPU for large context swaps. I expect the CPU - GPU difference for single token inference to be much smaller though |
I see. That makes a lot of sense. And also explains why this project seems to be competitive in tokens/s with people running the oobabooga webui on GPU 🤔 |
Hi I am facing this issue while doing CPU Inference using GPT4ALL-1.3groovy model. |
I know this is closed but I just wanted to leave my $0.02 experience in case others come along. I run a workstation with 566GB RAM and Nvidia RTX 4070ti, usually using server (./examples/server/). The 4070ti with 12GB is great for chat and responses and reasonable models+prompt context size. If using completion instead of chat I regularly have good luck running CPU mode with a Example:
This will hint to the completion that it needs to keep going. It's great because you can sort of guide it or edit parts of the completion you would like adjusted. |
@jboero can you provide an example llama model u are using that can have 400k context size? i thought context generated is limited by the llama model u use. max i know is 192k which i never get to touch coz it's way above my 8gb vram and 64gb ram. how do you get 400k tokens generated? |
Sorry I should have specified "prompt context" not model context. Ala
|
I realize I'm quite late to this party... but I'm able to get an infinite response running llama.cpp on my raspberry pi 4. When I load it up, I use these commands.
I'm going to let you all know that I've been playing around with AI for literally the past 2 weeks, so I barely know what I'm doing, but I'm still learning. (In case you looked at those commands and gagged, laughed or asked yourself, "What in God's holy name is this moron doing?"). I'm kind of like the guy that in an emergency situation that can lift a car off of someone, because in that moment I'm not thinking about all the reasons I can't lift a car. What I mean is, I think I got llama.cpp to work in the first place by brute force and ignorance, so I can't explain why it works, it just does for me. So I hope that helps anyone who knows less than I do, or it opens doors for someone who knows more than I do. |
@MrMBag thank you, that really helped to understand. |
If we use
-n 1000000
to have a very long output (for a story for example),it stops generating quite fast, after around 30 lines, probably because of this line of code.
It would be nice if we could have longer outputs and also the possibility to have infinite output, stopping only on
Ctrl-C
.We could maybe specify that
-n 0
will trigger that infinite output mode.That issue is a bit related to issue #23
The text was updated successfully, but these errors were encountered: