-
Notifications
You must be signed in to change notification settings - Fork 542
Closed
Labels
Description
🚀 The feature, motivation and pitch
Description
The Eagle3 acceleration for GPU has been successfully implemented and merged in [this PR]((vllm-project/vllm#16937). However, the NPU implementation is still missing. Eagle3 is currently the state-of-the-art (SOTA) acceleration technique, and its implementation on NPU would significantly enhance the performance and efficiency of our models running on NPU devices.
Alternatives
Proposed Solution:
- Finish the draft model and forward on npu.
- Ensure draft model implementation is functional and meets the basic requirements.
- Ensure paged attention for draft model is optimized for NPU and performs efficiently.
Additional context
- GPU Implementation: Completed and merged in PR #16937.
- NPU Implementation: Not yet implemented.