-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: llama-box crashes when setting --ctx-size to 0 #21
Comments
Update: And sending 4,096 characters also causes this issue (file attached) Here's some other example files for testing: |
Update: setting the context size to 4096 seems to workaround this issue However, my original issue that caused this investigation is still happening, so I will keep digging... Regarding this issue, I do believe it is still an issue. Setting the context window higher/lower should impact the stability of llama-box. If the user sets it to something unsupported, the program should exit with an informative error at startup. |
|
please test with v0.0.104. |
Running again with the latest version, sending that huge 1.6M '1's file. server command:
|
This issue can be reproduced by using the same test steps in llama-box v0.0.106. And it doesn't occur every time
Environment |
a more reasonable reproduce way can be the following steps.
this issue happens on the first time long context shifting during decoding, we can get a warning log before crashing: |
context shifting cause chaos, it should be better to leverage YaRN or customize RoPE to extend the context size without fine-tunning. ggml-org/llama.cpp#2054 (comment) |
@Finenyaco , please tested with v0.0.107. |
I have tested with my 1.6M ones file - three times For all of the tests there was no crash - so this seems fixed for me |
Verified in v0.0.107 |
Hello
Summary of issue:
I've mad a python program that has quite a lot of complexity - and from time to time I see llama-box crash. I'm trying to narrow down the reason as to why this might happen. So this may be it.
Expectation:
When I set "-c" to "0", I expect the context window of the model to be used.
When a prompt is sent that is larger than the context window, I expect it to be truncated to the size of the context window, where 'older' tokens are discarded (unless "--no-context-shift" is used)
What actually happens:
I set "-c" to "0", and send a large prompt, llama-box crashes
System specs:
When this happens, CPU/RAM/VRAM are all OK - so it doesn't look like an out of memory (OOM) error
Here is the minimum needed to reproduce the issue:
I am running llama-box with this command:
llama-box.exe --port 8082 -c 0 -np 2 --host 0.0.0.0 -m "models/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf" --mmproj "models/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf"
model: https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF/blob/main/Qwen2-VL-7B-Instruct-abliterated-Q6_K_L.gguf
mmproj: https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF/blob/main/mmproj-Qwen2-VL-7B-Instruct-abliterated-f16.gguf
I am sending the server this command:
curl http://localhost:8082/v1/chat/completions -H "Content-Type: application/json" -d "@lots_of_ones.txt"
And the file 'lots_of_ones.txt' contains 1,638,400 occurences of the character: '1' (along with a little JSON):
{"model": "hermes2", "messages": [{"role":"user", "content": "1[...]1"}]}
Output from llama-box when it crashes:
The text was updated successfully, but these errors were encountered: