-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative decoding with lookahead #2790
base: main
Are you sure you want to change the base?
Conversation
Hi @jjjjohnson Could you help resolve the conflicts? Thanks. |
Done |
Could you share any performance results? |
I find this PR cannot run DeepSeek V3, have you test this model? |
No. What is the error message? |
mla crash,no show very useful message. |
db61dbe
to
f775d00
Compare
I find this PR cannot run llama 8b with triton backend, the error is: 46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify Does this PR support triton backend? |
I think mla attention not support tree mask,so this pr not work with Deepseek. |
lookahead depend on flashinfer tree mask attention.triton now is not support tree mask. |
Motivation
n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.
Related resources
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Overall workflow
Features
Checklist