-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.`
#2418
Comments
same error.. set gpu_memory_utilization=0.75 but resp is too short... |
Having the same issue running CodeLLaMa 13b instruct hf with the langchain integration for vLLM.
|
same error. |
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26064). Try increasing Mistral-7B-v0.1 |
Same exception with |
Set |
I wrote fixed value max_model_len. vllm/config.py: 104 |
I have the same issue here |
i am haivng this problem with this: python -m vllm.entrypoints.openai.api_server --model abacusai/Smaug-72B-v0.1 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 9002 but we get this: ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8512). Try increasing is there a work around to launch this form the command line? |
Yes, it looks like you can add Line 22 in e433c11
|
thx, will try that! |
Oops. You'll wanna use hypens and not underscores. Line 143 in e433c11
|
yup found that LOL! |
Same error,same solving way,weird.. |
Is there a solution to this problem now? I still encounter this problem on gemma-7b. |
Maybe try a lower model length should be fine,just keep watching the logs then makes the |
The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer. |
No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect. |
Try to change |
not work, Can you post the modified files and code? |
Line 30 in 24aecf4
Line 51 in 24aecf4
Line 51 in 24aecf4
Code
|
it is work that I run this model with huggingface or vllm in RTX4090. And I also use google/gemma-7b with hf to work successfully. |
Hello, I used to use the same engine as follow: With 2 NVIDIA L4 GPUs it now shows the same error: Why and how should I return to the previous configuration setting? I already ran a set of experiences on the last configuration, and I must maintain the same. |
I see the code Line 486 in e221910
and "model_max_length": 1000000000000000019884624838656, from https://huggingface.co/codellama/CodeLlama-13b-hf/blob/main/tokenizer_config.json
Maybe you changed the max_model_len like #322 (comment), but I'm not sure. |
I am unsure if this would suit me as I need to keep the engine setting the same for the whole experiment. Could anyone clarify this point if this trick won't change the model performance (inference part)? |
is max_model_len=2048arbitrary or just simple the max number of tokens i cen expect to inference? |
So is max_model_len best to be set to the maximum number of tokens i may need to inference? |
I am using the following code.
Getting the issue:
I have defined max_model_len < my KV cache but it still gives the same issue. I am open to make changes in loading parameters. Can anyone tell what can be done here? |
What is the "max_model_len" equivalent argument while initializing LLM class of vllm. LLM(max_model_len=2048) doesn't seem to work; there must be some other argument! |
Is there a way to programmatically get the max number of tokens that can be stored in KV cache before running the model? |
How to increase kvcache? |
Oh never mind I found the code. Lines 179 to 204 in 5ed3505
Apparently vllm profile runs the model in order to find the available kv cache. |
Thanks, it works! Do you know how to set this in the parameters of your own scripts instead of changing the official scripts? |
On A10G with a single GPU, tried running Llama-3.1-8b-Instruct with vLLM with the following configuration:
And got the following error: Then I changed to the following configuration:
And got this error:
Why did the KV cache size change? |
@markVaykhansky chunked prefill is enabled by default for llama 3.1 , but when you change max_model_len to 25600, it will be disabled, and the memory reserved changed. |
this solved my problem. I got the same issue with using RTX 4090 24GB. |
I followed the Quickstart tutorial and deployed the Chinese-llama-alpaca-2 model using vllm, and I got the following error.
***@***:~/Code/experiment/***/ToG$ CUDA_VISIBLE_DEVICES=0 python load_llm.py INFO 01-11 15:51:02 llm_engine.py:70] Initializing an LLM engine with config: model='/home/***/***/models/alpaca-2', tokenizer='/home/***/***/models/alpaca-2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) INFO 01-11 15:51:18 llm_engine.py:275] # GPU blocks: 229, # CPU blocks: 512 Traceback (most recent call last): File "load_llm.py", line 8, in <module> llm = LLM(model='/home/***/***/models/alpaca-2') File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 105, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 309, in from_engine_args engine = cls(*engine_configs, File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 114, in __init__ self._init_cache() File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing
gpu_memory_utilizationor decreasing
max_model_lenwhen initializing the engine.
my code is:
What's going on and what do I need to do to fix the error?
I run the code with RTX3090(24G) * 1.
Looking forward to a reply!
The text was updated successfully, but these errors were encountered: