-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
🚀 The feature, motivation and pitch
Summary
We propose adding tree-attention-based speculative decoding support to vLLM to improve inference throughput, token acceptance rates, and memory efficiency during generation. This approach is inspired by SpecInfer ([arXiv:2305.09781](https://arxiv.org/abs/2310.08560)) and EAGLE2 (arXiv:2406.16858), both of which demonstrate state-of-the-art performance using token tree structures and topology-aware attention for parallel decoding.
Motivation
Current speculative decoding strategies in vLLM rely on batch expansion or multi-head proposals. These approaches face key limitations:
- ❌ Low token acceptance rates, especially in long sequences or large models
- ❌ Redundant computation and memory traffic due to duplicate KV cache updates
- ❌ Inefficient parallelism across speculative paths
Both SpecInfer and EAGLE2 propose using tree-based token structures combined with topology-aware masking, enabling efficient evaluation of multiple speculative paths in a single kernel invocation. This approach improves both accuracy and decoding speed.
Reference: SpecInfer and EAGLE2
🔍 SpecInfer (ASPOS’24)
- Introduces a tree-structured speculative decoding algorithm
- Uses topo-aware attention masking and single-kernel tree scoring
- Achieves higher token throughput and better acceptance across large LLMs
Throughput and latency comparison between SpecInfer and standard speculative decoding:
- Up to 3× throughput improvement
- Significant latency reduction per generated token
- Works efficiently on long sequences and large target models
🔍 EAGLE2 (Empirical Methods in Natural Language Processing’24)
EAGLE2 is another recent method using tree attention to accelerate speculative decoding. It builds a hierarchical token tree and applies a mask-aware attention mechanism to support efficient tree traversal and scoring. The algorithm enables parallel generation of token drafts in a tree structure and shows competitive performance for decoding large models.
EAGLE-2 is:
- 4x faster than vanilla decoding (13B).
- 1.4x faster than EAGLE-1 (13B).
Expected Benefits
✅ Significantly increased decoding throughput (tokens/sec)
✅ Improved token acceptance rates from the target model
✅ Lower HBM usage via shared prefix reuse
✅ Strong scalability with large sequence lengths and large model sizes
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.