Skip to content

[Bug]: IndexError: list index out of range on chunked prefill with speculative decoding #20531

@saidrhs

Description

@saidrhs

Environment

Engine Configuration:

V1 LLM engine (v0.9.1) with config:
- model='/tmp/model/'
- speculative_config=SpeculativeConfig(method='eagle3', model='/tmp/model/eagle_head/', num_spec_tokens=5)
- tensor_parallel_size=8
- pipeline_parallel_size=1
- quantization=compressed-tensors
- max_seq_len=131072

Prefix caching and chunked prefill are enabled by default V1 behavior. This issue also occurred in vLLM v0.8.5.post1 and has been hard to reproduce.

Model: Llama-3.3-70B-Instruct

Hardware: 8 H200 GPUs

🐛 Describe the bug

IndexError with chunked prefill and speculative decoding.

# Partial prefill (rare case).

Error Logs

Worker Stack Trace (across multiple ranks):

IndexError: list index out of range

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
    output = func(*args, **kwargs)
  
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
    output = self.model_runner.execute_model(scheduler_output)
  
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1428, in execute_model
    next_token_id = req_state.get_token_id(seq_len)
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_input_batch.py", line 53, in get_token_id
    return self.output_token_ids[idx - self.num_prompt_tokens]

Engine Core Stack Trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 508, in run_engine_core
    engine_core.run_busy_loop()
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 535, in run_busy_loop
    self._process_engine_step()
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 560, in _process_engine_step
    outputs, model_executed = self.step_fn()
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 231, in step
    model_output = self.execute_model(scheduler_output)
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 217, in execute_model
    raise err
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 211, in execute_model
    return self.model_executor.execute_model(scheduler_output)
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
    (output, ) = self.collective_rpc("execute_model")
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
    result = get_response(w, dequeue_timeout)
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
    raise RuntimeError("Worker failed with error 'list index out of range'")

AsyncLLM Stack Trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
    outputs = await engine_core.get_output_async()
  
  File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
    raise self._format_exception(outputs) from None

Final Error:

vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Scheduler State at Time of Error:

SchedulerOutput(
    scheduled_new_reqs=[],
    scheduled_cached_reqs=[
        CachedRequestData(
            req_id='f1f4e063-786b-4cf9-8ebc-548f201ed419',
            resumed_from_preemption=false,
            new_token_ids=[t1],
            new_block_ids=[[]],
            num_computed_tokens=t2
        ),
        CachedRequestData(
            req_id='93ae9bbb-f390-4f02-9d16-ddcba4e3b6d2',
            resumed_from_preemption=false,
            new_token_ids=[t3],
            new_block_ids=[[t4]],
            num_computed_tokens=t5
        )
    ],
    num_scheduled_tokens={
        'f1f4e063-786b-4cf9-8ebc-548f201ed419': 6,
        '93ae9bbb-f390-4f02-9d16-ddcba4e3b6d2': 6
    },
    total_num_scheduled_tokens=12,
    scheduled_spec_decode_tokens={
        'f1f4e063-786b-4cf9-8ebc-548f201ed419': [t6, t7, t8, t9, t10],
        '93ae9bbb-f390-4f02-9d16-ddcba4e3b6d2': [t11, t12, t13, t14, t15]
    }
)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions