ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.` #2418

handsomelys · 2024-01-11T08:04:22Z

I followed the Quickstart tutorial and deployed the Chinese-llama-alpaca-2 model using vllm, and I got the following error.
***@***:~/Code/experiment/***/ToG$ CUDA_VISIBLE_DEVICES=0 python load_llm.py INFO 01-11 15:51:02 llm_engine.py:70] Initializing an LLM engine with config: model='/home/***/***/models/alpaca-2', tokenizer='/home/***/***/models/alpaca-2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) INFO 01-11 15:51:18 llm_engine.py:275] # GPU blocks: 229, # CPU blocks: 512 Traceback (most recent call last): File "load_llm.py", line 8, in <module> llm = LLM(model='/home/***/***/models/alpaca-2') File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 105, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 309, in from_engine_args engine = cls(*engine_configs, File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 114, in __init__ self._init_cache() File "/home/***/anaconda3/envs/lys-llm-env/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 284, in _init_cache raise ValueError( ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

my code is:

from vllm import LLM, SamplingParams

prompts = [
    "hello, who is you?",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model='/home/b3432/***/models/alpaca-2')
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Genrate text: {generated_text!r}")

What's going on and what do I need to do to fix the error?
I run the code with RTX3090(24G) * 1.
Looking forward to a reply!

The text was updated successfully, but these errors were encountered:

chopin1998 · 2024-01-11T10:18:42Z

same error..

set gpu_memory_utilization=0.75
and low max_model_len ,

but resp is too short...

ishand0101 · 2024-01-11T20:52:10Z

Having the same issue running CodeLLaMa 13b instruct hf with the langchain integration for vLLM.

The model's max seq len (16384) is larger than the maximum number of tokens that can be stored in KV cache (11408). Try increasinggpu_memory_utilizationor decreasingmax_model_lenwhen initializing the engine. (type=value_error)

byerose · 2024-01-12T09:23:15Z

same error.

gree2 · 2024-01-13T09:04:25Z

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26064). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Mistral-7B-v0.1

aklakl · 2024-01-14T19:43:49Z

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

byerose · 2024-01-17T12:19:56Z

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Set max_model_len< KV cache. It works.

AI-General · 2024-01-25T15:23:03Z

I wrote fixed value max_model_len.

vllm/config.py: 104
# self.max_model_len = _get_and_verify_max_len(self.hf_config,
# max_model_len)
self.max_model_len = 4096

ZhangzihanGit · 2024-02-06T23:51:08Z

I have the same issue here

silvacarl2 · 2024-02-19T17:41:15Z

i am haivng this problem with this:

python -m vllm.entrypoints.openai.api_server --model abacusai/Smaug-72B-v0.1 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 9002

but we get this:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8512). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

is there a work around to launch this form the command line?

mhillebrand · 2024-02-19T18:47:35Z

i am haivng this problem with this:

python -m vllm.entrypoints.openai.api_server --model abacusai/Smaug-72B-v0.1 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 9002

but we get this:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8512). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

is there a work around to launch this form the command line?

Yes, it looks like you can add --max_model_len 4096 to your command.

vllm/vllm/engine/arg_utils.py

Line 22 in e433c11

max_model_len: Optional[int] = None

silvacarl2 · 2024-02-19T18:49:10Z

thx, will try that!

mhillebrand · 2024-02-19T18:53:58Z

Oops. You'll wanna use hypens and not underscores.

vllm/vllm/engine/arg_utils.py

Line 143 in e433c11

parser.add_argument('--max-model-len',

silvacarl2 · 2024-02-19T19:27:46Z

yup found that LOL!

ElinLiu0 · 2024-02-23T07:59:43Z

Same error,same solving way,weird..
Why'd they have initalized this variables too large?

Nuclear6 · 2024-02-27T12:12:42Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.

ElinLiu0 · 2024-02-27T12:20:40Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.

Nuclear6 · 2024-02-27T12:43:16Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗？我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题，只需继续观察日志，然后使您计算机上的 Q,K,V 缓存仍然保留，以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.

ElinLiu0 · 2024-02-27T12:45:20Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗？我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题，只需继续观察日志，然后使您计算机上的 Q,K,V 缓存仍然保留，以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.

Nuclear6 · 2024-02-27T12:50:30Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗？我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题，只需继续观察日志，然后使您计算机上的 Q,K,V 缓存仍然保留，以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.文档指出支持gemma-7b模型，还支持很多其他大型模型。是机器配置的原因吗？这是一台 RTX4090 台式电脑。

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.不知道那个伙伴，我目前使用阿里云 Qwen1.5-7B-INT4，通过将 model_length 设置为 1024，它按预期工作正常。

My guess is that the machine configuration is incorrect.

ElinLiu0 · 2024-02-27T12:51:56Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗？我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题，只需继续观察日志，然后使您计算机上的 Q,K,V 缓存仍然保留，以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.文档指出支持gemma-7b模型，还支持很多其他大型模型。是机器配置的原因吗？这是一台 RTX4090 台式电脑。

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.不知道那个伙伴，我目前使用阿里云 Qwen1.5-7B-INT4，通过将 model_length 设置为 1024，它按预期工作正常。

My guess is that the machine configuration is incorrect.

What's your tool using now,looks pretty cool

ElinLiu0 · 2024-02-27T12:52:48Z

Is there a solution to this problem now? I still encounter this problem on gemma-7b.现在这个问题有解决办法吗？我在 gemma-7b 上仍然遇到这个问题。

Maybe try a lower model length should be fine,just keep watching the logs then makes the Q,K,V cache on your machine still remaining will your hosting your localized gemma.也许尝试较低的模型长度应该没问题，只需继续观察日志，然后使您计算机上的 Q,K,V 缓存仍然保留，以便托管您的本地化 Gemma。

The document states that the gemma-7b model is supported, and many other large models are supported. Is it because of the machine configuration? This is an RTX4090 desktop computer.文档指出支持gemma-7b模型，还支持很多其他大型模型。是机器配置的原因吗？这是一台 RTX4090 台式电脑。

No idea of that mate,i'm current using AliCloud Qwen1.5-7B-INT4,by seting model_length into 1024,it's working fine as expect.不知道那个伙伴，我目前使用阿里云 Qwen1.5-7B-INT4，通过将 model_length 设置为 1024，它按预期工作正常。

My guess is that the machine configuration is incorrect.

抱歉我才看到你翻译中文，不好意思

Nuclear6 · 2024-02-27T12:55:53Z

https://rahulschand.github.io/gpu_poor/

DsnTgr · 2024-03-02T06:57:38Z

Try to change gpu_memory_utilization=0.95 or 1.0 for vllm. Then it will run successfully.

Nuclear6 · 2024-03-05T13:19:04Z

gpu_memory_utilization

not work, Can you post the modified files and code?

DsnTgr · 2024-03-06T07:51:04Z

vllm/vllm/engine/arg_utils.py

Line 30 in 24aecf4

gpu_memory_utilization: float = 0.90

vllm/vllm/entrypoints/llm.py

Line 51 in 24aecf4

gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to

vllm/vllm/entrypoints/llm.py

Line 51 in 24aecf4

gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to

Code

from vllm import LLM, SamplingParams

llm = LLM(model="HuggingFaceH4/zephyr-7b-beta", gpu_memory_utilization=0.95)

...

DsnTgr · 2024-03-06T07:53:04Z

it is work that I run this model with huggingface or vllm in RTX4090. And I also use google/gemma-7b with hf to work successfully.

SafeyahShemali · 2024-03-12T15:03:36Z

Hello,

I used to use the same engine as follow:
python -m vllm.entrypoints.openai.api_server --model="codellama/CodeLlama-13b-Instruct-hf" --tensor-parallel-size=2

With 2 NVIDIA L4 GPUs it now shows the same error:
ValueError: The model's max seq len (16384) is larger than the maximum number of tokens that can be stored in KV cache (14528). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Why and how should I return to the previous configuration setting? I already ran a set of experiences on the last configuration, and I must maintain the same.

DsnTgr · 2024-03-13T07:35:45Z

I see the code self.max_num_batched_tokens = max(max_model_len, 2048) from

vllm/vllm/config.py

Line 486 in e221910

self.max_num_batched_tokens = max(max_model_len, 2048)

and
"model_max_length": 1000000000000000019884624838656, from https://huggingface.co/codellama/CodeLlama-13b-hf/blob/main/tokenizer_config.json

Maybe you changed the max_model_len like #322 (comment), but I'm not sure.

SafeyahShemali · 2024-03-14T06:42:41Z

I am unsure if this would suit me as I need to keep the engine setting the same for the whole experiment.

Could anyone clarify this point if this trick won't change the model performance (inference part)?

silvacarl2 · 2024-04-04T18:43:47Z

is max_model_len=2048arbitrary or just simple the max number of tokens i cen expect to inference?

silvacarl2 · 2024-04-08T16:01:15Z

So is max_model_len best to be set to the maximum number of tokens i may need to inference?

Hzzhang-nlp · 2024-04-15T14:04:54Z

same error..

set gpu_memory_utilization=0.75 and low max_model_len ,

but resp is too short...

I set it to 0.8, and the problem was solved，it's like this:

deeshantk · 2024-04-24T09:17:24Z

I am using the following code.

llm = VLLM(
                vllm_kwargs={"quantization": "awq"},
                max_model_len=30624,
                model=TheBloke/Mistral-7B-Instruct-v0.2-AWQ,
                # gpu_memory_utilization=1.0,
                trust_remote_code=True,  # mandatory for hf models
                max_new_tokens=512,
                speculative_max_model_len = 30624,
                top_k=40,
                top_p=0.95,
                temperature=0.7,
                repetition_penalty= 1.1,
            )

Getting the issue:

lm = VLLM(
  File "/home/ubuntu/isolated_product_description/ipd/lib/python3.8/site-packages/langchain_core/load/serializable.py", line 120, in __init__
    super().__init__(**kwargs)
  File "/home/ubuntu/isolated_product_description/ipd/lib/python3.8/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for VLLM
__root__
  The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (32624). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. (type=value_error)

I have defined max_model_len < my KV cache but it still gives the same issue.

I am open to make changes in loading parameters. Can anyone tell what can be done here?

chintanckg · 2024-06-22T12:56:20Z

What is the "max_model_len" equivalent argument while initializing LLM class of vllm.

LLM(max_model_len=2048) doesn't seem to work; there must be some other argument!

steve2972 · 2024-07-10T05:11:23Z

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Set max_model_len< KV cache. It works.

Is there a way to programmatically get the max number of tokens that can be stored in KV cache before running the model?
Or do you have to run it at least once to get the value?

strongliu110 · 2024-07-12T08:32:48Z

How to increase kvcache？

steve2972 · 2024-07-17T07:00:26Z

Same exception with ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (176). Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Set max_model_len< KV cache. It works.

Is there a way to programmatically get the max number of tokens that can be stored in KV cache before running the model? Or do you have to run it at least once to get the value?

Oh never mind I found the code.

vllm/vllm/worker/worker.py

Lines 179 to 204 in 5ed3505

    
           self.model_runner.profile_run() 
        
           # Calculate the number of blocks that can be allocated with the 
        
           # profiled peak memory. 
        
           torch.cuda.synchronize() 
        
           free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info() 
        
           # NOTE(woosuk): Here we assume that the other processes using the same 
        
           # GPU did not change their memory usage during the profiling. 
        
           peak_memory = self.init_gpu_memory - free_gpu_memory 
        
           assert peak_memory > 0, ( 
        
               "Error in memory profiling. This happens when the GPU memory was " 
        
               "not properly cleaned up before initializing the vLLM instance.") 
        
           cache_block_size = self.get_cache_block_size_bytes() 
        
           num_gpu_blocks = int( 
        
               (total_gpu_memory * self.cache_config.gpu_memory_utilization - 
        
                peak_memory) // cache_block_size) 
        
           num_cpu_blocks = int(self.cache_config.swap_space_bytes // 
        
                                cache_block_size) 
        
           num_gpu_blocks = max(num_gpu_blocks, 0) 
        
           num_cpu_blocks = max(num_cpu_blocks, 0) 
        
           if self.model_runner.lora_manager: 
        
               self.model_runner.remove_all_loras() 
        
           gc.collect() 
        
           torch.cuda.empty_cache() 
        
           return num_gpu_blocks, num_cpu_blocks

Apparently vllm profile runs the model in order to find the available kv cache.

xiangru2020 · 2024-08-05T03:31:24Z

I wrote fixed value max_model_len.

vllm/config.py: 104 # self.max_model_len = _get_and_verify_max_len(self.hf_config, # max_model_len) self.max_model_len = 4096

Thanks, it works! Do you know how to set this in the parameters of your own scripts instead of changing the official scripts?

markVaykhansky · 2024-09-22T13:48:46Z

On A10G with a single GPU, tried running Llama-3.1-8b-Instruct with vLLM with the following configuration:

      "enable_chunked_prefill": false,
      "enable_prefix_caching": true,
      "kv_cache_dtype": "auto"

And got the following error:
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (29136). Try increasinggpu_memory_utilizationor decreasingmax_model_len when initializing the engine.

Then I changed to the following configuration:

      "enable_chunked_prefill": false,
      "enable_prefix_caching": true,
      "kv_cache_dtype": "auto",
      "max_model_len": 25600

And got this error:

`ValueError: The model's max seq len (25600) is larger than the maximum number of tokens that can be stored in KV cache (14592). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Why did the KV cache size change?

youkaichao · 2024-09-22T17:48:35Z

@markVaykhansky chunked prefill is enabled by default for llama 3.1 , but when you change max_model_len to 25600, it will be disabled, and the memory reserved changed.

OfficerChul · 2024-11-08T02:16:58Z

i am haivng this problem with this:
python -m vllm.entrypoints.openai.api_server --model abacusai/Smaug-72B-v0.1 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 9002
but we get this:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8512). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
is there a work around to launch this form the command line?

Yes, it looks like you can add --max_model_len 4096 to your command.

vllm/vllm/engine/arg_utils.py

Line 22 in e433c11

max_model_len: Optional[int] = None

this solved my problem. I got the same issue with using RTX 4090 24GB.
I changed the None part to 4096 since LLaMA only supports 4096 tokens.

lanking520 mentioned this issue Jan 24, 2024

add max model length support on vLLM deepjavalibrary/djl-serving#1510

Merged

sherdencooper mentioned this issue Feb 27, 2024

use error sherdencooper/GPTFuzz#27

Open

nihalnayak mentioned this issue Mar 12, 2024

Maximum sequence length error BatsResearch/bonito#7

Closed

bjwswang mentioned this issue Mar 20, 2024

run qwen-1.5-4b-chat with vllm worker failed: max seq len (32768) is larger than the maximum number of tokens lm-sys/FastChat#3177

Open

maaquib mentioned this issue May 23, 2024

[dep] Updgrades dependencies for lmi-dist engine deepjavalibrary/djl-serving#1949

Merged

mscheong01 mentioned this issue Aug 6, 2024

[Feature]: Adjust max_model_len based on viable KV space #7195

Closed

sethahrenbach mentioned this issue Aug 27, 2024

configs/newfile results in vLLM error deepseek-ai/DeepSeek-Prover-V1.5#3

Closed

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.` #2418

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.` #2418

Comments

handsomelys commented Jan 11, 2024

chopin1998 commented Jan 11, 2024 • edited Loading

ishand0101 commented Jan 11, 2024

byerose commented Jan 12, 2024

gree2 commented Jan 13, 2024

aklakl commented Jan 14, 2024

byerose commented Jan 17, 2024

AI-General commented Jan 25, 2024

ZhangzihanGit commented Feb 6, 2024

silvacarl2 commented Feb 19, 2024

mhillebrand commented Feb 19, 2024

silvacarl2 commented Feb 19, 2024

mhillebrand commented Feb 19, 2024

silvacarl2 commented Feb 19, 2024

ElinLiu0 commented Feb 23, 2024

Nuclear6 commented Feb 27, 2024

ElinLiu0 commented Feb 27, 2024

Nuclear6 commented Feb 27, 2024

ElinLiu0 commented Feb 27, 2024

Nuclear6 commented Feb 27, 2024

ElinLiu0 commented Feb 27, 2024

ElinLiu0 commented Feb 27, 2024

Nuclear6 commented Feb 27, 2024

DsnTgr commented Mar 2, 2024

Nuclear6 commented Mar 5, 2024

DsnTgr commented Mar 6, 2024

Code

DsnTgr commented Mar 6, 2024 • edited Loading

SafeyahShemali commented Mar 12, 2024

DsnTgr commented Mar 13, 2024

SafeyahShemali commented Mar 14, 2024

silvacarl2 commented Apr 4, 2024

silvacarl2 commented Apr 8, 2024

Hzzhang-nlp commented Apr 15, 2024

deeshantk commented Apr 24, 2024

chintanckg commented Jun 22, 2024

steve2972 commented Jul 10, 2024

strongliu110 commented Jul 12, 2024

steve2972 commented Jul 17, 2024 • edited Loading

xiangru2020 commented Aug 5, 2024

markVaykhansky commented Sep 22, 2024 • edited Loading

youkaichao commented Sep 22, 2024

OfficerChul commented Nov 8, 2024

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.` #2418

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.` #2418

chopin1998 commented Jan 11, 2024 •

edited

Loading

DsnTgr commented Mar 6, 2024 •

edited

Loading

steve2972 commented Jul 17, 2024 •

edited

Loading

markVaykhansky commented Sep 22, 2024 •

edited

Loading