-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please upgrade the KV cache size yes using --ctx-size
#6617
Comments
Do you have an appropriate response then ? |
You are welcome to submit a new scenario using the server test framework to reproduce any issue you identified. It will help the maintainers understand it and eventually submit a patch. Contributions are welcomed, and helping the community requires a non null effort. Regarding the closed discussion, some questions may require more attention than others. I did not notice here that the server was infinite looping because the KV cache limit is reached. I am not sure this is unexpected. @ggerganov, probably I missed something |
As stated: " If it can't handle a query, it should just reject it and move on." It shouldn't go into an infinite loop and stop serving requests. As for reproducing it: I don't know which of my queries are triggering it, as they're randomly generated (within given frameworks) and it queues up many queries in advance (perhaps others might have a specific one?). But you do have the error message. The error is in examples/server/server.cpp, in update_slots(), in response to a failure in llama_decode. The symptoms are that the GPU activity drops to zero and instead the server just rapidly repeats that message while cycling through batch sizes. I don't think saying "helping the community requires a non null effort" to someone who has never refused a request to help is productive - and nor am I the only person who has brought up this problem here. My commandline: . venv/bin/activate So yes, if there's some more datagathering you want from me, given the above information, please let me know how you would like me to go about doing so. |
You need to provide detailed steps to reproduce your issue. For the server, there is a dedicated framework for that. Again, if your kv cache is full , it is just not adapted for your usage. You did not provide information to prove that the server goes to an infinite loop in this situation. If you see I advise to change the way you are requesting for help, we are a team of volunteers here, feedback like that is not appreciable and especially it does not help at all. |
All anyone here cares about is the server not going into an infinite loop, which should be considered a de minimis requirement for a server. Closing fixing an infinite loop bug on everyone who brings it up (aka, not just me) as "not planned" is beyond bizarre for a server maintainer. Especially when the server runs a non-deterministic process of which the user has little clue of what the output will be in advance (it's not like users ask the server "please write 20 million tokens", it's going to be "do some simple task" but then the server gets stuck in a loop repeating itself). This (server #2 in this case) is not okay for a server. Again: I have never refused a request to gather more data. All I ask is that you not close this bug that's hitting multiple people and paralyzing their servers is "not planned". In the meantime, I'll keep working on seeing if I can figure out a consistently repeatable way to reproduce it. It happens on average once every 15 hours or so to me. |
@phymbert #6603 was closed reasonably - the user asked how to increase context size and an answer was provided. Similar topic was discussed in #5737 (comment) @enn-nafnlaus First try to update to latest |
Thanks for this :) Can't update to the latest update at the moment (I've moved and the server has no net access), but will try --n_predict to see if that works as a workaround (note: I don't know the number of input tokens in advance on the server, so I'll have to set --n_predict to the max context). And will keep trying to see if I can figure out how to reproduce it. Is there any way to have the server print out what queries it's processing, so I can know what one(s) is/are causing the problem? I see it seems to be creating a file called "llama.log", but I can't specify the filename, so I assume that all running servers write to the same logfile... |
@ggerganov, thanks. We have a test feature |
Are you sure that this is wrong usage? I think the logs hint very strongly that there is an infinite loop in In any case, it is also not ok to generate indefinitely with repeated context shifts if the model never generates an eos. At the very least, that's a denial of service vulnerability. |
Yes, either
It is not an infinite loop inside |
Wouldn't you see log messages such as this if it is was truly the model generating indefinitely with context shifts?
If the slot was released properly, that should be the end of it, but the same thing happens again and again. |
n_ctx is 16k - see my commandline above. It very much is an infinite loop if it gets stuck on processing a single query indefinitely, all incoming queries just sit there and the GPU usage drops to 0 while the server spits out spam until I restart it. The instant it hits n_batch == 1 it jumps back to n_batch 512 failures, continuing over and over in a loop. Spam is pumped out at an immense speed (note that the logfile hit 73 gigs overnight), cycling through this over and over, until I notice that it's gotten into this state and kill the server |
These logs unfortunately are only in |
@enn-nafnlaus please try again with the current version of the server, and if it happens again, try to collect logs that include both stdout and stderr. Your client may be ignoring error responses from the server. |
I've reconstructed the generated text using the provided log and in the end the generation indeed falls into EOS-less generation:
Even so, the But it could be something that has already been fixed since there was some work recently to improve |
Client sees only timeouts. Server is truly idle, as you can see in the nvtop screenshot, while it spits out megs-per-second of spam. (Do note that, as can be seen, I have two instances of the server running at once, one for each GPU, in case that might be related) Will update as soon as the server gets net connected and will report back. In the meantime I've added in -v --n-predict 16384 and 2>&1 | tee server.log. Thanks for your help - hopefully we'll ultimately figure out what's sending it into a loop. :) |
I'll note that I don't think this is a hallucination / lack of EOS generation issue, because GPU usage drops to 0 when this happens, whereas if it's generating, but just fails to generate an EOS, GPU usage should still continue to be pegged out. The error messages also come way too fast (megs per second) for there to be actual token generations during this time. I will note one possibility worth considering, which is that I'm running (as noted earlier) two servers, one on each GPU. And I think it's only been server #2 that's been failing. Now, they're also each running separate tasks. To isolate whether it's a server + GPU issue, or a task issue, that's triggering this bug, I've also swapped which task is running on which server/GPU. So when it fails next (that is, assuming --n-predict doesn't prevent failures), it should isolate which aspect is the problem. |
The command line that you showed before does not include |
I have been trying to reproduce the conditions necessary for the |
Sorry, I had to type that in by hand, because I wasn't connected to the server (the lack of net access is really frustrating); part of the command line got lost. I also edited the paths and ports for security reasons. I have my laptop on the local network now, and it's mobile tethered, so here's the current commandline, copied directly (but again, with paths and port changed): /server -v --model /path/to/TheBloke_Mixtral-8x7B-Instruct-v0.1-GGUF/mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf --port 1234 --n-gpu-layers 9999 --batch-size 2048 --threads 4 --threads-batch 1 --numa --mlock --ctx-size 16384 -cb --n-predict 16384 Note that the -v and --n-predict are new. Here's how things look normally when the bug hasn't hit- GPUs pegged. Compare to the earlier image after the bug hits. |
Maybe I should try direct mobile tethering the server so I can update. The current version of llama.cpp is from 29 January. ED: Success. Got the server tethered and updated llama.cpp to the most recent commit on master. |
It is probably this: https://github.com/ggerganov/llama.cpp/pull/5708/files#r1501799090 And I just lost 2h more here. No, now we have better logs :) |
If my understanding is correct and that's committed (it looks like it is), hopefully that'll do the trick. Will update when either the issue happens again, or it fails to happen again. :) |
Nope, another fail sadly. This time on the first server / GPU. So it's not bound to the second server or second GPU. Unfortunately, I don't think the output of tee from the server is very useful. Here's what happens at the transition point between token generation and infinite loop: I had captured (in theory) a log of all queries sent to the server, via the client side, but I accidentally deleted it when restarting the server (facepalm). Will need to wait for the next infinite loop. That said, I'm not sure even that would be of use unless one were to replay all queries from the start, due to the nature of batching. And it's even worse if the problem has something to do with having two servers running. Anyway, this rules out the following solutions:
What might I try next? ED: Thinking about it last night, I decided to try to make it deterministic:
This should help ensure that infinite loops can be tracked down to a single query (assuming it still loops under these conditions) - albeit at the cost of slowing down my work. |
I'm dealing with something similar to this, but mainly to stress-test the server processing requests from 4 clients simultaneously (documents of up to 3000 tokens and relatively short questions of 8 tokens) and generating a maximum of 512 tokens @enn-nafnlaus According to the command you claim to be using to start the server: . venv/bin/activate
CUDA_VISIBLE_DEVICES=1 ./server --model /path/to/TheBloke_Mixtral-8x7B-Instruct-v0.1-GGUF/mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf --port 1234 --batch-size 2048 --threads 4 --threads-batch 1 --numa --mlock --ctx-size 16384 -cb I notice that you don't have the Perhaps what I'm commenting on may not be relevant, but while I'm also actively conducting tests on the server, I may encounter strange behaviors and will try to fix them. |
See further down in the thread - that command was hand-typed, not copy-pasted, due to net access issues; I accidentally forgot to type that part. |
Well, this is annoying . After everything I did to try to make it fully deterministic (fixed seed, no continuous batching, one query submitted at a time and waiting for the response)... it's still not deterministic) (unless I'm doing something wrong here). I printed out an exact URI and json for the final query, and it just completed normally. The only other thing I can think to try, and it's not pleasant, is to try shutting down one of my GPUs and its respective server, to see if the problem relates to having two GPUs and servers. Though that will obviously cut my output in half :( I'll be fully open here, I'll take any solution, even if it's an ugly hack. Like, for example, a timeout would be just fine. Or detecting that, "Hey, it's spit out the same messages at a rate of several megs per second, maybe things aren't okay at home". We don't actually have to solve the problem, just detect the failure. |
Can confirm that it still goes into an infinite loop when only one server is running. |
This seems plausible. @enn-nafnlaus Could you try running with |
Will do! |
Okay, this is the point where I discover, and then sheepishly admit, that - having gotten so used to working with Python projects - I forgot to run "make" after doing my last update. :Þ So I haven't actually yet ruled out that it was a version issue. Anyway, ball is in my court now. Will report back. ED: Huh, the numa flag went away? |
Have been running for two days now, and it hasn't gotten stuck in an infinite loop. :) Now I'm going to need to backtrack and see if I can narrow down whether it was the update or one (or both) of the two new flags you had me add that did the trick. |
So please remove my name from the issue summary. Saying "apologies for my mistake" has never been hurting anyone. You are welcome. |
Why have you been so consistently rude about this? Georgi has been nothing but helpful and respectful. You've done nothing but gaslight, complain, insist that an infinite loop isn't a bug, demand proof that they're actually experiencing the bug that they're reporting, and close other peoples' issues before waiting to see if they were actually fixed. Do you want me to actually find out whether it was an update or either of the two added flags that fixed it? Because I've spent the past days trying to help you track down this bug that's been hitting plenty of your users, and I was planning to continue trying to figure out what the fix was even though I now have a workaround for myself. But if you don't actually care...? |
Edited for you. Yes I don't care. |
You may be able to resolve the issue by setting --ctx-size to a value larger than the -b value you specified, and by setting the prompt value with -c. This could be a temporary solution and I'm not sure about the underlying principle, so please don't put too much trust in it. However, I think it's worth trying. |
Originally posted by in #6603 (comment)
This is not an appropriate response to people having this problem. --ctx-size is a memory-limited operation; of course we'd set it higher if we could. Mine is at 16k and I still hit this problem.The appropriate response to running out of tokens is to fail the query. It's not for the server to go into an infinite loop and stop all further processing. I've lost days worth of processing time to this bug, when I log into my server and discover that it's no longer running because of this.The server should never go into an infinite loop; I mean, obviously? If it can't handle a query, it should just reject it and move on.EDIT: The folk was just running a very outdated server version. Always use
--n-predict N
to avoid infinite loop.The text was updated successfully, but these errors were encountered: