Skip to content

[Feature]: Tree-Attention Support for Speculative Decoding #18327

@yesredpig

Description

@yesredpig

🚀 The feature, motivation and pitch

Summary

We propose adding tree-attention-based speculative decoding support to vLLM to improve inference throughput, token acceptance rates, and memory efficiency during generation. This approach is inspired by SpecInfer ([arXiv:2305.09781](https://arxiv.org/abs/2310.08560)) and EAGLE2 (arXiv:2406.16858), both of which demonstrate state-of-the-art performance using token tree structures and topology-aware attention for parallel decoding.


Motivation

Current speculative decoding strategies in vLLM rely on batch expansion or multi-head proposals. These approaches face key limitations:

  • ❌ Low token acceptance rates, especially in long sequences or large models
  • ❌ Redundant computation and memory traffic due to duplicate KV cache updates
  • ❌ Inefficient parallelism across speculative paths

Both SpecInfer and EAGLE2 propose using tree-based token structures combined with topology-aware masking, enabling efficient evaluation of multiple speculative paths in a single kernel invocation. This approach improves both accuracy and decoding speed.


Reference: SpecInfer and EAGLE2

🔍 SpecInfer (ASPOS’24)

  • Introduces a tree-structured speculative decoding algorithm
  • Uses topo-aware attention masking and single-kernel tree scoring
  • Achieves higher token throughput and better acceptance across large LLMs

Throughput and latency comparison between SpecInfer and standard speculative decoding:

  • Up to 3× throughput improvement
  • Significant latency reduction per generated token
  • Works efficiently on long sequences and large target models

🔍 EAGLE2 (Empirical Methods in Natural Language Processing’24)

EAGLE2 is another recent method using tree attention to accelerate speculative decoding. It builds a hierarchical token tree and applies a mask-aware attention mechanism to support efficient tree traversal and scoring. The algorithm enables parallel generation of token drafts in a tree structure and shows competitive performance for decoding large models.

EAGLE-2 is:

  • 4x faster than vanilla decoding (13B).
  • 1.4x faster than EAGLE-1 (13B).

Expected Benefits

✅ Significantly increased decoding throughput (tokens/sec)
✅ Improved token acceptance rates from the target model
✅ Lower HBM usage via shared prefix reuse
✅ Strong scalability with large sequence lengths and large model sizes

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requeststaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions