Skip to content

[Bug]: vLLM crashes on tokenized embedding input #11375

@FriedrichBethke

Description

@FriedrichBethke

Your current environment

Using the docker image vllm/vllm-openai:v0.6.2

Model Input Dumps

No response

🐛 Describe the bug

We are facing issues when hosting Salesforce/SFR-Embedding-Mistral.
Due to a configuration error a wrong tokenizer was used (see random chars in request in error log below), creating the following request body:

{
  "input": 
[[2,29497,25,220,931,11194,7699,20,12,9335,482,510,44,18683,60,1817,369,356,8653,3350,4819,482,510,2485,1129,73,404,569,1896,2221,32057,483,263,916,91381,14,931,11194,7699,20,12,9335,933,567,35185,25]]
  ,
  "model": "sfr-embedding-mistral",
  "encoding_format": "base64"
} 

Sending this request causes the following error leading to a restart of the container each time:

INFO 12-19 01:10:59 logger.py:36] Received request embd-bbc1cecb453e46349ce78ae3338c25a3-0: prompt: '</s>管   op answersPlayer\x11\t rozurext) coins9info that onNetwork wurde sigurext shortidentFers              Br["porter under\x0b op answersPlayer\x11\t rozPro &\x16', params: PoolingParams(additional_metadata=None), prompt_token_ids: [2, 29497, 25, 220, 931, 11194, 7699, 20, 12, 9335, 482, 510, 44, 18683, 60, 1817, 369, 356, 8653, 3350, 4819, 482, 510, 2485, 1129, 73, 404, 569, 1896, 2221, 32057, 483, 263, 916, 91381, 14, 931, 11194, 7699, 20, 12, 9335, 933, 567, 35185, 25], lora_request: None, prompt_adapter_request: None.
INFO 12-19 01:10:59 engine.py:288] Added request embd-bbc1cecb453e46349ce78ae3338c25a3-0.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
CRITICAL 12-19 01:10:59 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
ERROR 12-19 01:10:59 engine.py:157] RuntimeError('CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 12-19 01:10:59 engine.py:157] Traceback (most recent call last):
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 12-19 01:10:59 engine.py:157]     self.run_engine_loop()
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 12-19 01:10:59 engine.py:157]     request_outputs = self.engine_step()
ERROR 12-19 01:10:59 engine.py:157]                       ^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 12-19 01:10:59 engine.py:157]     raise e
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 12-19 01:10:59 engine.py:157]     return self.engine.step()
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 12-19 01:10:59 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 12-19 01:10:59 engine.py:157]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 12-19 01:10:59 engine.py:157]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 12-19 01:10:59 engine.py:157]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 12-19 01:10:59 engine.py:157]     output = self.model_runner.execute_model(
ERROR 12-19 01:10:59 engine.py:157]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-19 01:10:59 engine.py:157]     return func(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/embedding_model_runner.py", line 115, in execute_model
ERROR 12-19 01:10:59 engine.py:157]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 12-19 01:10:59 engine.py:157]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama_embedding.py", line 41, in forward
ERROR 12-19 01:10:59 engine.py:157]     return self.model.forward(input_ids, positions, kv_caches,
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 12-19 01:10:59 engine.py:157]     hidden_states, residual = layer(
ERROR 12-19 01:10:59 engine.py:157]                               ^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 12-19 01:10:59 engine.py:157]     hidden_states = self.self_attn(
ERROR 12-19 01:10:59 engine.py:157]                     ^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 181, in forward
ERROR 12-19 01:10:59 engine.py:157]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 12-19 01:10:59 engine.py:157]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 98, in forward
ERROR 12-19 01:10:59 engine.py:157]     return self.impl.forward(query,
ERROR 12-19 01:10:59 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/xformers.py", line 595, in forward
ERROR 12-19 01:10:59 engine.py:157]     out = self._run_memory_efficient_xformers_forward(
ERROR 12-19 01:10:59 engine.py:157]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/xformers.py", line 717, in _run_memory_efficient_xformers_forward
ERROR 12-19 01:10:59 engine.py:157]     attn_bias = BlockDiagonalCausalMask.from_seqlens(
ERROR 12-19 01:10:59 engine.py:157]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/attn_bias.py", line 726, in from_seqlens
ERROR 12-19 01:10:59 engine.py:157]     q_seqinfo = _SeqLenInfo.from_seqlens(q_seqlen, device=device)
ERROR 12-19 01:10:59 engine.py:157]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/attn_bias.py", line 358, in from_seqlens
ERROR 12-19 01:10:59 engine.py:157]     min_seqlen, max_seqlen, seqstart_py, seqstart = cls._get_seqstart(
ERROR 12-19 01:10:59 engine.py:157]                                                     ^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/attn_bias.py", line 346, in _get_seqstart
ERROR 12-19 01:10:59 engine.py:157]     seqstart = torch.tensor(seqstart_py, dtype=torch.int32, device=device)
ERROR 12-19 01:10:59 engine.py:157]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] RuntimeError: CUDA error: device-side assert triggered
ERROR 12-19 01:10:59 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 12-19 01:10:59 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 12-19 01:10:59 engine.py:157] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 12-19 01:10:59 engine.py:157] 
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

Help would be highly appreciated on how to configure vLLM to avoid the processing of such inputs or to resolve the issue entirely.

Thanks in advance!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions