You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100S-PCIE-32GB
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 15
On-line CPU(s) list: 0-14
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 15
Stepping: 7
BogoMIPS: 5786.40
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat vnmi umip pku ospke avx512_vnni md_clear arch_capabilities
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 480 KiB (15 instances)
L1i cache: 480 KiB (15 instances)
L2 cache: 60 MiB (15 instances)
L3 cache: 240 MiB (15 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-14
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-14 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
After making a call to POST /v1/chat/completions with the following content:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Meta-Llama-3-8B-Instruct",
"logit_bias": {
"AI": -100
},
"messages": [
{
"role": "system",
"content": "You are a a helpful assistant."
},
{
"role": "user",
"content": "What can I do with AI? Provide a very short answer."
}
]
}'
vLLM returns an error and then falls into AsyncEngineDeadError. From this point, the inference server is unable to serve any request and /health returns a HTTP 500 Internal Server Error.
Logs:
2024-06-25 12:27:28.192 TRACE: 172.19.0.1:56910 - HTTP connection made
2024-06-25 12:27:28.214 TRACE: 172.19.0.1:56910 - ASGI [1281] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.19.0.3', 8000), 'client': ('172.19.0.1', 56910), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/v1/chat/completions', 'raw_path': b'/v1/chat/completions', 'query_string': b''}
2024-06-25 12:27:28.221 TRACE: 172.19.0.1:56910 - ASGI [1281] Receive {'type': 'http.request', 'body': '<223 bytes>', 'more_body': False}
2024-06-25 12:27:28.236 INFO 06-25 10:27:28 async_llm_engine.py:561] Received request cmpl-dbbcd0dea34644228aab6c59085edc42: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat can I do with AI? Provide a very short answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8157, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 3923, 649, 358, 656, 449, 15592, 30, 40665, 264, 1633, 2875, 4320, 13, 128009, 128006, 78191, 128007, 271], lora_request: None.
2024-06-25 12:27:28.238 DEBUG 06-25 10:27:28 async_llm_engine.py:524] Got new requests!
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] Engine background task failed
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] Traceback (most recent call last):
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] return_value = task.result()
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 529, in run_engine_loop
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] has_requests_in_progress = await asyncio.wait_for(
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] return fut.result()
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 503, in engine_step
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] request_outputs = await self.engine.step_async()
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] output = await self.model_executor.execute_model_async(
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] output = await make_async(self.driver_worker.execute_model
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] result = self.fn(*self.args, **self.kwargs)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] return func(*args, **kwargs)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] output = self.model_runner.execute_model(seq_group_metadata_list,
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] return func(*args, **kwargs)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 747, in execute_model
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] logits = self.model.compute_logits(hidden_states, sampling_metadata)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 377, in compute_logits
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] logits = self.logits_processor(self.lm_head.weight, hidden_states,
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] return self._call_impl(*args, **kwargs)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] return forward_call(*args, **kwargs)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 59, in forward
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] logits = _apply_logits_processors(logits, sampling_metadata)
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 116, in _apply_logits_processors
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] logits_row = logits_processor(past_tokens_ids,
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/protocol.py", line 245, in logit_bias_logits_processor
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] logits[int(token_id)] += bias
2024-06-25 12:27:28.328 ERROR 06-25 10:27:28 async_llm_engine.py:52] ValueError: invalid literal for int() with base 10: 'AI'
2024-06-25 12:27:28.331 Exception in callback functools.partial(<function _log_task_completion at 0x751ff1613760>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x751fd98b9ea0>>)
2024-06-25 12:27:28.331 handle: <Handle functools.partial(<function _log_task_completion at 0x751ff1613760>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x751fd98b9ea0>>)>
2024-06-25 12:27:28.331 Traceback (most recent call last):
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
2024-06-25 12:27:28.331 return_value = task.result()
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 529, in run_engine_loop
2024-06-25 12:27:28.331 has_requests_in_progress = await asyncio.wait_for(
2024-06-25 12:27:28.331 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2024-06-25 12:27:28.331 return fut.result()
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 503, in engine_step
2024-06-25 12:27:28.331 request_outputs = await self.engine.step_async()
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
2024-06-25 12:27:28.331 output = await self.model_executor.execute_model_async(
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
2024-06-25 12:27:28.331 output = await make_async(self.driver_worker.execute_model
2024-06-25 12:27:28.331 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-06-25 12:27:28.331 result = self.fn(*self.args, **self.kwargs)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-25 12:27:28.331 return func(*args, **kwargs)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
2024-06-25 12:27:28.331 output = self.model_runner.execute_model(seq_group_metadata_list,
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-25 12:27:28.331 return func(*args, **kwargs)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 747, in execute_model
2024-06-25 12:27:28.331 logits = self.model.compute_logits(hidden_states, sampling_metadata)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 377, in compute_logits
2024-06-25 12:27:28.331 logits = self.logits_processor(self.lm_head.weight, hidden_states,
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-06-25 12:27:28.331 return self._call_impl(*args, **kwargs)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-06-25 12:27:28.331 return forward_call(*args, **kwargs)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 59, in forward
2024-06-25 12:27:28.331 logits = _apply_logits_processors(logits, sampling_metadata)
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 116, in _apply_logits_processors
2024-06-25 12:27:28.331 logits_row = logits_processor(past_tokens_ids,
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/protocol.py", line 245, in logit_bias_logits_processor
2024-06-25 12:27:28.331 logits[int(token_id)] += bias
2024-06-25 12:27:28.331 ValueError: invalid literal for int() with base 10: 'AI'
2024-06-25 12:27:28.331
2024-06-25 12:27:28.331 The above exception was the direct cause of the following exception:
2024-06-25 12:27:28.331
2024-06-25 12:27:28.331 Traceback (most recent call last):
2024-06-25 12:27:28.331 File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
2024-06-25 12:27:28.331 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
2024-06-25 12:27:28.331 raise AsyncEngineDeadError(
2024-06-25 12:27:28.331 vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
2024-06-25 12:27:28.333 INFO 06-25 10:27:28 async_llm_engine.py:167] Aborted request cmpl-dbbcd0dea34644228aab6c59085edc42.
2024-06-25 12:27:28.335 TRACE: 172.19.0.1:56910 - ASGI [1281] Send {'type': 'http.response.start', 'status': 400, 'headers': '<...>'}
2024-06-25 12:27:28.336 INFO: 172.19.0.1:56910 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-06-25 12:27:28.337 TRACE: 172.19.0.1:56910 - ASGI [1281] Send {'type': 'http.response.body', 'body': '<124 bytes>'}
2024-06-25 12:27:28.339 TRACE: 172.19.0.1:56910 - ASGI [1281] Completed
2024-06-25 12:27:28.697 TRACE: 10.0.3.98:50040 - HTTP connection lost
2024-06-25 12:27:29.886 TRACE: 10.0.1.65:50586 - HTTP connection made
2024-06-25 12:27:29.889 TRACE: 10.0.1.65:50586 - ASGI [1282] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('172.19.0.3', 8000), 'client': ('10.0.1.65', 50586), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'GET', 'path': '/health', 'raw_path': b'/health', 'query_string': b''}
2024-06-25 12:27:29.895 DEBUG 06-25 10:27:29 async_llm_engine.py:837] Starting health check...
2024-06-25 12:27:29.897 TRACE: 10.0.1.65:50586 - ASGI [1282] Send {'type': 'http.response.start', 'status': 500, 'headers': '<...>'}
2024-06-25 12:27:29.898 INFO: 10.0.1.65:50586 - "GET /health HTTP/1.1" 500 Internal Server Error
2024-06-25 12:27:29.899 TRACE: 10.0.1.65:50586 - ASGI [1282] Send {'type': 'http.response.body', 'body': '<21 bytes>'}
2024-06-25 12:27:29.900 TRACE: 10.0.1.65:50586 - ASGI [1282] Raised exception
2024-06-25 12:27:29.905 ERROR: Exception in ASGI application
2024-06-25 12:27:29.905 Traceback (most recent call last):
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
2024-06-25 12:27:29.905 result = await app( # type: ignore[func-returns-value]
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
2024-06-25 12:27:29.905 return await self.app(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/message_logger.py", line 84, in __call__
2024-06-25 12:27:29.905 raise exc from None
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/message_logger.py", line 80, in __call__
2024-06-25 12:27:29.905 await self.app(scope, inner_receive, inner_send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
2024-06-25 12:27:29.905 await super().__call__(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
2024-06-25 12:27:29.905 await self.middleware_stack(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
2024-06-25 12:27:29.905 raise exc
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
2024-06-25 12:27:29.905 await self.app(scope, receive, _send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
2024-06-25 12:27:29.905 await self.app(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
2024-06-25 12:27:29.905 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
2024-06-25 12:27:29.905 raise exc
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
2024-06-25 12:27:29.905 await app(scope, receive, sender)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
2024-06-25 12:27:29.905 await self.middleware_stack(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
2024-06-25 12:27:29.905 await route.handle(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
2024-06-25 12:27:29.905 await self.app(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
2024-06-25 12:27:29.905 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
2024-06-25 12:27:29.905 raise exc
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
2024-06-25 12:27:29.905 await app(scope, receive, sender)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
2024-06-25 12:27:29.905 response = await func(request)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
2024-06-25 12:27:29.905 raw_response = await run_endpoint_function(
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
2024-06-25 12:27:29.905 return await dependant.call(**values)
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
2024-06-25 12:27:29.905 await openai_serving_chat.engine.check_health()
2024-06-25 12:27:29.905 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 839, in check_health
2024-06-25 12:27:29.905 raise AsyncEngineDeadError("Background loop is stopped.")
2024-06-25 12:27:29.905 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
2024-06-25 12:27:29.905 TRACE: 10.0.1.65:50586 - HTTP connection lost
I am able to reproduce the bug 100% of the time.
The text was updated successfully, but these errors were encountered:
Your current environment
🐛 Describe the bug
After making a call to POST
/v1/chat/completions
with the following content:(The
logit_bias
parameter is invalid, the mapping should be int to int not string to int: https://platform.openai.com/docs/api-reference/chat/create#chat-create-logit_bias)The model used is: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
vLLM returns an error and then falls into
AsyncEngineDeadError
. From this point, the inference server is unable to serve any request and/health
returns a HTTP 500 Internal Server Error.Logs:
I am able to reproduce the bug 100% of the time.
The text was updated successfully, but these errors were encountered: