-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Closed as not planned
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity
Description
Your current environment
Using the docker image vllm/vllm-openai:v0.6.2
Model Input Dumps
No response
🐛 Describe the bug
We are facing issues when hosting Salesforce/SFR-Embedding-Mistral.
Due to a configuration error a wrong tokenizer was used (see random chars in request in error log below), creating the following request body:
{
"input":
[[2,29497,25,220,931,11194,7699,20,12,9335,482,510,44,18683,60,1817,369,356,8653,3350,4819,482,510,2485,1129,73,404,569,1896,2221,32057,483,263,916,91381,14,931,11194,7699,20,12,9335,933,567,35185,25]]
,
"model": "sfr-embedding-mistral",
"encoding_format": "base64"
} Sending this request causes the following error leading to a restart of the container each time:
INFO 12-19 01:10:59 logger.py:36] Received request embd-bbc1cecb453e46349ce78ae3338c25a3-0: prompt: '</s>管 op answersPlayer\x11\t rozurext) coins9info that onNetwork wurde sigurext shortidentFers Br["porter under\x0b op answersPlayer\x11\t rozPro &\x16', params: PoolingParams(additional_metadata=None), prompt_token_ids: [2, 29497, 25, 220, 931, 11194, 7699, 20, 12, 9335, 482, 510, 44, 18683, 60, 1817, 369, 356, 8653, 3350, 4819, 482, 510, 2485, 1129, 73, 404, 569, 1896, 2221, 32057, 483, 263, 916, 91381, 14, 931, 11194, 7699, 20, 12, 9335, 933, 567, 35185, 25], lora_request: None, prompt_adapter_request: None.
INFO 12-19 01:10:59 engine.py:288] Added request embd-bbc1cecb453e46349ce78ae3338c25a3-0.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [983,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
CRITICAL 12-19 01:10:59 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO: 127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
ERROR 12-19 01:10:59 engine.py:157] RuntimeError('CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 12-19 01:10:59 engine.py:157] Traceback (most recent call last):
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 12-19 01:10:59 engine.py:157] self.run_engine_loop()
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 12-19 01:10:59 engine.py:157] request_outputs = self.engine_step()
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 12-19 01:10:59 engine.py:157] raise e
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 12-19 01:10:59 engine.py:157] return self.engine.step()
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 12-19 01:10:59 engine.py:157] outputs = self.model_executor.execute_model(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 12-19 01:10:59 engine.py:157] output = self.driver_worker.execute_model(execute_model_req)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 12-19 01:10:59 engine.py:157] output = self.model_runner.execute_model(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-19 01:10:59 engine.py:157] return func(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/embedding_model_runner.py", line 115, in execute_model
ERROR 12-19 01:10:59 engine.py:157] hidden_states = model_executable(**execute_model_kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157] return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157] return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama_embedding.py", line 41, in forward
ERROR 12-19 01:10:59 engine.py:157] return self.model.forward(input_ids, positions, kv_caches,
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 12-19 01:10:59 engine.py:157] hidden_states, residual = layer(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157] return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157] return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 12-19 01:10:59 engine.py:157] hidden_states = self.self_attn(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157] return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157] return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 181, in forward
ERROR 12-19 01:10:59 engine.py:157] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 01:10:59 engine.py:157] return self._call_impl(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 01:10:59 engine.py:157] return forward_call(*args, **kwargs)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 98, in forward
ERROR 12-19 01:10:59 engine.py:157] return self.impl.forward(query,
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/xformers.py", line 595, in forward
ERROR 12-19 01:10:59 engine.py:157] out = self._run_memory_efficient_xformers_forward(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/xformers.py", line 717, in _run_memory_efficient_xformers_forward
ERROR 12-19 01:10:59 engine.py:157] attn_bias = BlockDiagonalCausalMask.from_seqlens(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/attn_bias.py", line 726, in from_seqlens
ERROR 12-19 01:10:59 engine.py:157] q_seqinfo = _SeqLenInfo.from_seqlens(q_seqlen, device=device)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/attn_bias.py", line 358, in from_seqlens
ERROR 12-19 01:10:59 engine.py:157] min_seqlen, max_seqlen, seqstart_py, seqstart = cls._get_seqstart(
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/attn_bias.py", line 346, in _get_seqstart
ERROR 12-19 01:10:59 engine.py:157] seqstart = torch.tensor(seqstart_py, dtype=torch.int32, device=device)
ERROR 12-19 01:10:59 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 01:10:59 engine.py:157] RuntimeError: CUDA error: device-side assert triggered
ERROR 12-19 01:10:59 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 12-19 01:10:59 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 12-19 01:10:59 engine.py:157] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 12-19 01:10:59 engine.py:157]
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1]
Help would be highly appreciated on how to configure vLLM to avoid the processing of such inputs or to resolve the issue entirely.
Thanks in advance!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity