-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Environment
Engine Configuration:
V1 LLM engine (v0.9.1) with config:
- model='/tmp/model/'
- speculative_config=SpeculativeConfig(method='eagle3', model='/tmp/model/eagle_head/', num_spec_tokens=5)
- tensor_parallel_size=8
- pipeline_parallel_size=1
- quantization=compressed-tensors
- max_seq_len=131072
Prefix caching and chunked prefill are enabled by default V1 behavior. This issue also occurred in vLLM v0.8.5.post1 and has been hard to reproduce.
Model: Llama-3.3-70B-Instruct
Hardware: 8 H200 GPUs
🐛 Describe the bug
IndexError
with chunked prefill and speculative decoding.
vllm/vllm/v1/worker/gpu_model_runner.py
Line 1612 in 9fb52e5
# Partial prefill (rare case). |
Error Logs
Worker Stack Trace (across multiple ranks):
IndexError: list index out of range
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
output = func(*args, **kwargs)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
output = self.model_runner.execute_model(scheduler_output)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1428, in execute_model
next_token_id = req_state.get_token_id(seq_len)
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_input_batch.py", line 53, in get_token_id
return self.output_token_ids[idx - self.num_prompt_tokens]
Engine Core Stack Trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 508, in run_engine_core
engine_core.run_busy_loop()
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 535, in run_busy_loop
self._process_engine_step()
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 560, in _process_engine_step
outputs, model_executed = self.step_fn()
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 231, in step
model_output = self.execute_model(scheduler_output)
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 217, in execute_model
raise err
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 211, in execute_model
return self.model_executor.execute_model(scheduler_output)
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
(output, ) = self.collective_rpc("execute_model")
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
result = get_response(w, dequeue_timeout)
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
raise RuntimeError("Worker failed with error 'list index out of range'")
AsyncLLM Stack Trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
outputs = await engine_core.get_output_async()
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
raise self._format_exception(outputs) from None
Final Error:
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Scheduler State at Time of Error:
SchedulerOutput(
scheduled_new_reqs=[],
scheduled_cached_reqs=[
CachedRequestData(
req_id='f1f4e063-786b-4cf9-8ebc-548f201ed419',
resumed_from_preemption=false,
new_token_ids=[t1],
new_block_ids=[[]],
num_computed_tokens=t2
),
CachedRequestData(
req_id='93ae9bbb-f390-4f02-9d16-ddcba4e3b6d2',
resumed_from_preemption=false,
new_token_ids=[t3],
new_block_ids=[[t4]],
num_computed_tokens=t5
)
],
num_scheduled_tokens={
'f1f4e063-786b-4cf9-8ebc-548f201ed419': 6,
'93ae9bbb-f390-4f02-9d16-ddcba4e3b6d2': 6
},
total_num_scheduled_tokens=12,
scheduled_spec_decode_tokens={
'f1f4e063-786b-4cf9-8ebc-548f201ed419': [t6, t7, t8, t9, t10],
'93ae9bbb-f390-4f02-9d16-ddcba4e3b6d2': [t11, t12, t13, t14, t15]
}
)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
ekagra-ranjan
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working