[Feature]: Tree-Attention Support for Speculative Decoding

### 🚀 The feature, motivation and pitch

### **Summary**

We propose adding tree-attention-based speculative decoding support to vLLM to improve inference throughput, token acceptance rates, and memory efficiency during generation. This approach is inspired by SpecInfer ([[arXiv:2305.09781](https://arxiv.org/abs/2305.09781)](https://arxiv.org/abs/2310.08560)) and EAGLE2 ([arXiv:2406.16858](https://arxiv.org/abs/2406.16858)), both of which demonstrate state-of-the-art performance using token tree structures and topology-aware attention for parallel decoding.

--------------------------------------------------------------------------------------------------------------

### **Motivation**
Current speculative decoding strategies in vLLM rely on batch expansion or multi-head proposals. These approaches face key limitations:

- ❌ Low token acceptance rates, especially in long sequences or large models
- ❌ Redundant computation and memory traffic due to duplicate KV cache updates
- ❌ Inefficient parallelism across speculative paths

Both SpecInfer and EAGLE2 propose using tree-based token structures combined with topology-aware masking, enabling efficient evaluation of multiple speculative paths in a single kernel invocation. This approach improves both accuracy and decoding speed.

--------------------------------------------------------------------------------------------------------------

### **Reference: SpecInfer and EAGLE2**

**🔍 SpecInfer (ASPOS’24)**

<img src="https://github.com/user-attachments/assets/a5278d9d-68b4-4604-b380-ffb4ca9765d0" width="650" height="auto">

- Introduces a tree-structured speculative decoding algorithm
- Uses topo-aware attention masking and single-kernel tree scoring
- Achieves higher token throughput and better acceptance across large LLMs

**_Throughput and latency comparison between SpecInfer and standard speculative decoding:_**

- Up to 3× throughput improvement
- Significant latency reduction per generated token
- Works efficiently on long sequences and large target models



**🔍 EAGLE2 (Empirical Methods in Natural Language Processing’24)**

EAGLE2 is another recent method using tree attention to accelerate speculative decoding. It builds a hierarchical token tree and applies a mask-aware attention mechanism to support efficient tree traversal and scoring. The algorithm enables parallel generation of token drafts in a tree structure and shows competitive performance for decoding large models.

<img src="https://github.com/user-attachments/assets/164118fe-c9c8-45ca-ac92-a4f21b3cbb8a" width="600" height="auto">


EAGLE-2 is:

- 4x faster than vanilla decoding (13B).
- 1.4x faster than EAGLE-1 (13B).

---------------------------------------------------------------------------------------------------------------------
### **Expected Benefits**

 ✅ Significantly increased decoding throughput (tokens/sec)
 ✅ Improved token acceptance rates from the target model
 ✅ Lower HBM usage via shared prefix reuse
 ✅ Strong scalability with large sequence lengths and large model sizes



### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Tree-Attention Support for Speculative Decoding #18327

🚀 The feature, motivation and pitch

Summary

Motivation

Reference: SpecInfer and EAGLE2

Expected Benefits

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Tree-Attention Support for Speculative Decoding #18327

Description

🚀 The feature, motivation and pitch

Summary

Motivation

Reference: SpecInfer and EAGLE2

Expected Benefits

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions