Speculative decoding with lookahead #2790

jjjjohnson · 2025-01-08T09:41:02Z

Motivation

n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.

Related resources

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Overall workflow

Features

No need to train draft model.
Trie tree will be updated with both prompt tokens and output tokens.
The draft tokens generation is a frequency based sort mechanism from the specific prompt tokens and ALL history output tokens(with evict).
Both Single-branch and Multi-branch are supported.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

jjjjohnson · 2025-01-08T09:45:17Z

import sglang as sgl
import time
import json
import numpy as np

def main():
    # Sample prompts.
    prompts = [
        '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你是谁？<|im_end|>\n<|im_start|>assistant\n'
    ]

    sampling_params = {"temperature": 0.7, "repetition_penalty":1,
                       "max_new_tokens": 256,"top_k": 1,
                       "stop_token_ids": [151645, 151644, 151643]}


    model_path = "Qwen/Qwen2-7B-Instruct"

    # Create an LLM.
    llm = sgl.Engine(model_path=model_path, speculative_one_branch=True, disable_cuda_graph=False, 
                     speculative_num_draft_tokens=4, speculative_algorithm='LOOKAHEAD', mem_fraction_static=0.60, 
                     watchdog_timeout=1e8, log_level='info')


    for idx in range(5):
        start = time.time()
        outputs = llm.generate(prompts, sampling_params)
        cos = time.time()-start
        completion_tokens = 0
        # Print the outputs.
        for prompt, output in zip(prompts, outputs):
            completion_tokens += output["meta_info"]["completion_tokens"]
            print(f"{output['text']}")
            print('======================')
        print(f"{idx=}!!!!!!!!! tps =: {completion_tokens/cos}\n\n")

if __name__ == "__main__":
    main()

zhyncs · 2025-01-11T15:14:21Z

Hi @jjjjohnson Could you help resolve the conflicts? Thanks.

jjjjohnson · 2025-01-12T13:52:14Z

Hi @jjjjohnson Could you help resolve the conflicts? Thanks.

Done

merrymercy · 2025-01-13T05:45:26Z

Could you share any performance results?

jjjjohnson · 2025-01-16T09:41:03Z

Could you share any performance results?

Sure!
Since the Lookahead speculative decode will cache input and output tokens, I run sglang.bench_serving 2 turns and disable the random.shuffle(dataset) to make the request same for 2 turns to compare the performance difference with normal decode.
Note: Lookahead speculative decode is turned off when batch size > 4 and I limit the max-concurrency and request-rate.

Start Server:

Normal decode:

python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct --trust-remote-code --tp 1

Lookahead speculative decode:

python -m sglang.launch_server --model-path /mnt/workspace/model_hub/Qwen2-7B-Instruct \
      --trust-remote-code --tp 1 --speculative-num-draft-tokens 4 --speculative-algorithm LOOKAHEAD --speculative-one-branch

Benchmark:

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --dataset-path /oss/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --max-concurrency 3 --request-rate 2

Result:

Normal decode first run turn:

Normal decode second run turn:

Lookahead speculative decode first run turn:

Lookahead speculative decode second run turn:

python/sglang/srt/server_args.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/speculative/lookahead_cache.py

python/sglang/srt/model_executor/cuda_graph_runner.py

mpjlu · 2025-02-07T08:06:58Z

I find this PR cannot run DeepSeek V3, have you test this model?

jjjjohnson · 2025-02-07T08:56:20Z

I find this PR cannot run DeepSeek V3, have you test this model?

No. What is the error message?

mpjlu · 2025-02-07T14:21:48Z

I find this PR cannot run DeepSeek V3, have you test this model?

No. What is the error message?

mla crash，no show very useful message.

mpjlu · 2025-02-11T12:18:31Z

I find this PR cannot run llama 8b with triton backend, the error is:

46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify
47 batch.seq_lens_sum = batch.seq_lens.sum().item()
48 RuntimeError: CUDA error: an illegal memory access was encountered

Does this PR support triton backend?

coolhok · 2025-02-12T02:26:33Z

mla

I think mla attention not support tree mask,so this pr not work with Deepseek.

coolhok · 2025-02-12T02:28:26Z

I find this PR cannot run llama 8b with triton backend, the error is:

46 File "/data/peng/sglang/python/sglang/srt/speculative/lookahead_utils.py", line 160, in verify 47 batch.seq_lens_sum = batch.seq_lens.sum().item() 48 RuntimeError: CUDA error: an illegal memory access was encountered

Does this PR support triton backend?

lookahead depend on flashinfer tree mask attention.triton now is not support tree mask.

lookahead init

9beb067

jjjjohnson requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners January 8, 2025 09:41

jjjjohnson mentioned this pull request Jan 8, 2025

[willing to PR] Add Lookahead speculative decoding #2772

Open

2 tasks

zhyncs added the enhancement New feature or request label Jan 8, 2025

jjjjohnson mentioned this pull request Jan 8, 2025

Development Roadmap (2024 Q4) #1487

Open

37 tasks

zhyncs requested a review from yukavio January 8, 2025 11:10

zhyncs assigned merrymercy, yukavio and hnyls2002 Jan 8, 2025

zhyncs added the high priority label Jan 11, 2025

merrymercy mentioned this pull request Jan 15, 2025

[WIP] ngram spec #2886

Open

yukavio reviewed Jan 25, 2025

View reviewed changes

Add detailed description for --speculative-lookahead-path

f775d00

jjjjohnson force-pushed the lookahead branch from 35cfaee to f775d00 Compare January 27, 2025 08:42

zhyncs assigned ispobock Jan 30, 2025

michaelfeil reviewed Feb 6, 2025

View reviewed changes

python/sglang/srt/model_executor/cuda_graph_runner.py Show resolved Hide resolved

jjjjohnson force-pushed the lookahead branch from db61dbe to f775d00 Compare February 11, 2025 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative decoding with lookahead #2790

Speculative decoding with lookahead #2790

jjjjohnson commented Jan 8, 2025 •

edited

Loading

jjjjohnson commented Jan 8, 2025

zhyncs commented Jan 11, 2025

jjjjohnson commented Jan 12, 2025

merrymercy commented Jan 13, 2025

jjjjohnson commented Jan 16, 2025 •

edited

Loading

mpjlu commented Feb 7, 2025

jjjjohnson commented Feb 7, 2025

mpjlu commented Feb 7, 2025

mpjlu commented Feb 11, 2025

coolhok commented Feb 12, 2025

coolhok commented Feb 12, 2025

Speculative decoding with lookahead #2790

Are you sure you want to change the base?

Speculative decoding with lookahead #2790

Conversation

jjjjohnson commented Jan 8, 2025 • edited Loading

Motivation

Related resources

Overall workflow

Features

Checklist

jjjjohnson commented Jan 8, 2025

zhyncs commented Jan 11, 2025

jjjjohnson commented Jan 12, 2025

merrymercy commented Jan 13, 2025

jjjjohnson commented Jan 16, 2025 • edited Loading

Start Server:

Normal decode:

Lookahead speculative decode:

Benchmark:

Result:

Normal decode first run turn:

Normal decode second run turn:

Lookahead speculative decode first run turn:

Lookahead speculative decode second run turn:

mpjlu commented Feb 7, 2025

jjjjohnson commented Feb 7, 2025

mpjlu commented Feb 7, 2025

mpjlu commented Feb 11, 2025

coolhok commented Feb 12, 2025

coolhok commented Feb 12, 2025

jjjjohnson commented Jan 8, 2025 •

edited

Loading

jjjjohnson commented Jan 16, 2025 •

edited

Loading