[Model][Speculative Decoding] Integrate PARD into vLLM #18541

zihaoanllm · 2025-05-22T09:33:23Z

Description:

This PR integrates PARD into vLLM. PARD (PARallel Draft model) is a speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models. It improves inference efficiency by allowing the draft model to predict multiple future tokens in a single forward pass, significantly reducing decoding latency. For detailed technical information, please refer to the technical report, github and blog.

AR and AR+ represent baseline auto-regressive generation using Transformers and Transformers+, respectively. VSD denotes vanilla speculative decoding. PARD refers to the proposed method in this work.

Support Model Series

Supports acceleration for models across various sizes in the following series: Llama3, Deepseek-R1-distilled-Qwen, and Qwen1.5/2/2.5.

Model Series	Draft Model Name	link
llama3	PARD-Llama-3.2-1B	🤗 HuggingFace
DSR Qwen	PARD-DeepSeek-R1-Distill-Qwen-1.5B	🤗 HuggingFace
Qwen	PARD-Qwen2.5-0.5B	🤗 HuggingFace

Summary of changes:

1. vllm/spec_decode/multi_step_worker.py:

Added the pard_infer function to support PARD-based speculative decoding. Key logic includes:

KV cache recomputation & seq_group_metadata_list update: Some of PARD's KV cache is derived from mask_token and needs to be recomputed. New mask tokens are introduced accordingly.
Proposal generation: Uses the SpeculativeProposals function to obtain token proposals.
Draft model forward: Only a single forward pass is needed for the draft model to generate multiple tokens.
Logits shape alignment: Aligns the logits shapes of the target and draft models when they differ.
Output conversion: Converts intermediate results into the standard output format.

2. vllm/spec_decode/batch_expansion.py:

Modified score_proposal to support:

keep_index: If not None, only returns results for specified indices. Recomputed KV cache outputs will be excluded.
return_output: If True, returns the full target_sampler_output.

3. vllm/spec_decode/spec_decode_worker.py:

Added a new interface to enable PARD inference.

Test

Test Code

import argparse
import json
import os
from vllm import LLM, SamplingParams
import requests
import numpy as np

os.environ.update({
    "VLLM_USE_V1": "0"
})
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="unsloth/Meta-Llama-3.1-8B-Instruct")
    parser.add_argument("--draft", type=str, default="amd/PARD-Llama-3.2-1B")
    parser.add_argument("--benchmark", type=str, default="humaneval")
    parser.add_argument("--max_num_seqs", type=int, default=1)
    parser.add_argument("--num_prompts", type=int, default=80)
    parser.add_argument("--num_spec_tokens", type=int, default=16)
    parser.add_argument("--tp", type=int, default=1)
    parser.add_argument("--temp", type=float, default=0)
    parser.add_argument("--ar", action='store_true')
    parser.add_argument("--disable-warmup", action='store_true')
    return parser.parse_args()

def main():
    args = parse_args()
    prompts = []
    for line in requests.get(f'https://raw.githubusercontent.com/AMD-AIG-AIMA/PARD/master/datas/bmk/{args.benchmark}.jsonl').text.splitlines():
        if line:
            prompts.append(json.loads(line)['data'])
    prompts = prompts[:args.num_prompts]
    datas = [[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}] for prompt in prompts]
    llm = LLM(
        model=args.model,
        enable_prefix_caching=False,
        tensor_parallel_size=args.tp,
        max_num_batched_tokens=8192,
        max_model_len=8192,
        max_num_seqs=args.max_num_seqs,
        gpu_memory_utilization=0.8,
        speculative_config=None if args.ar else {
            "model": args.draft,
            "num_speculative_tokens": args.num_spec_tokens
            },
    )

    sampling_params = SamplingParams(temperature=args.temp, max_tokens=512)

    ## warmup
    if not args.disable_warmup:
        print("warmup...")
        outputs = llm.chat(
            datas,
            sampling_params=sampling_params,
        )

    # inference
    print("inference...")
    outputs = llm.chat(
        datas,
        sampling_params=sampling_params,
    )

    # speed
    speed = []
    for output in outputs:
        speed.append([len(output.outputs[0].token_ids), (output.metrics.finished_time - output.metrics.first_token_time)])
        print(f"[anwer]:\n {output.outputs[0].text}")

    print(f"\n\n{'='*100}\n\n")
    print(f'[speed]: {np.array(speed)[:,0].sum() / np.array(speed)[:,1].sum()}\n')

    # accepted
    if not args.ar:
        acceptance_counts = [0] * (args.num_spec_tokens + 1)
        for output in outputs:
            for step, count in enumerate(
                    output.metrics.spec_token_acceptance_counts):
                acceptance_counts[step] += count

        print(f"[acceptance length]: {1 + (sum(acceptance_counts) / acceptance_counts[0])}")
    print(f"\n\n{'='*100}\n\n")


if __name__ == "__main__":
    main()

Test Result

Target model: Meta-Llama-3.1-8B-Instruct
Task: humaneval

Device	Accept Length	Baseline AR t/s	PARD t/s
MI300	6.2	181.29	416.30
MI250	6.2	67.99	169.08
H20	6.2	153.16	295.85

github-actions · 2025-05-22T09:33:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

zihaoanllm · 2025-05-26T02:02:56Z

Hi @LiuXiaoxuanPKU @njhill , I’d really appreciate it if you could help review this PR when you have some time. Let me know if anything needs clarification. Thanks a lot!

zihaoanllm · 2025-06-06T06:44:05Z

cc @WoosukKwon @DarkLight1337

Signed-off-by: root <anzihao_hh@126.com>

Signed-off-by: root <anzihao_hh@126.com> Signed-off-by: <anzihao_hh@126.com>

Signed-off-by: root <zihaoan2@amd.com> Signed-off-by: <zihaoan2@amd.com>

aarnphm · 2025-06-12T08:51:48Z

Hi @zihaoanllm, thanks for making this PR.

The code path you have here is v0. as of 0.9.x, all v0 codepath should be frozen and will only be merged for bugfix only. I would suggest implementing this under vllm/v1, as v0 code will be removed in 0.10.x and above

zihaoanllm · 2025-06-12T09:07:17Z

Hi @zihaoanllm, thanks for making this PR.

The code path you have here is v0. as of 0.9.x, all v0 codepath should be frozen and will only be merged for bugfix only. I would suggest implementing this under vllm/v1, as v0 code will be removed in 0.10.x and above

Hi @aarnphm , thanks for your response.

Currently, we’ve implemented a basic version under v0. The speculative decoding feature and multi-model KV cache support in v1 are still under development. I do plan to integrate this into v1 in the future.

In the meantime, merging this method into v0 would be helpful for our current usage and would also facilitate the future integration into v1.

zihaoanllm · 2025-06-17T00:46:45Z

Update Test Result

Test code

details

import argparse
import json
import os
from vllm import LLM, SamplingParams
import requests
from transformers import AutoTokenizer
from vllm.inputs import TokensPrompt


import numpy as np


os.environ.update({
    "VLLM_USE_V1": "0"
})
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="unsloth/Meta-Llama-3.1-8B-Instruct")
    parser.add_argument("--draft", type=str, default="amd/PARD-Llama-3.2-1B")
    parser.add_argument("--benchmark", type=str, default="humaneval")
    parser.add_argument("--max_num_seqs", type=int, default=1)
    parser.add_argument("--num_prompts", type=int, default=80)
    parser.add_argument("--num_spec_tokens", type=int, default=8)
    parser.add_argument("--tp", type=int, default=1)
    parser.add_argument("-t", "--token", type=int, default=512)
    parser.add_argument("--temp", type=float, default=0)
    parser.add_argument("--ar", action='store_true')
    parser.add_argument("-r", "--reasoning", action='store_true')
    parser.add_argument("--disable-warmup", action='store_true')
    return parser.parse_args()

def main():
    args = parse_args()
    prompts = []
    for line in requests.get(f'https://raw.githubusercontent.com/AMD-AIG-AIMA/PARD/master/datas/bmk/{args.benchmark}.jsonl').text.splitlines():
        if line:
            prompts.append(json.loads(line)['data'])
    prompts = prompts[:args.num_prompts]
    datas = [[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}] for prompt in prompts]
    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)
    texts = []
    for data in datas:
        text = tokenizer.apply_chat_template(
            data,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=args.reasoning,
        )
        texts.append(text)
    batch_input_ids = tokenizer(texts, return_attention_mask=False)['input_ids']
    batch_input_ids = [TokensPrompt(prompt_token_ids=ids) for ids in batch_input_ids]

    llm = LLM(
        model=args.model,
        enable_prefix_caching=False,
        tensor_parallel_size=args.tp,
        max_model_len=8192,
        max_num_seqs=args.max_num_seqs,
        gpu_memory_utilization=0.8,
        speculative_config=None if args.ar else {
            "model": args.draft,
            "num_speculative_tokens": args.num_spec_tokens
            },
        compilation_config={
            "splitting_ops": [],
            "compile_sizes": [],
            "cudagraph_capture_sizes": [
                256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,
                 120,112,104,96,88,80,72,64,56,48,40,34,33,32,3130,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1
            ],
            "max_capture_size": 256
        },
    )

    sampling_params = SamplingParams(temperature=args.temp, max_tokens=args.token)

    ## warmup
    if not args.disable_warmup:
        print("warmup...")
        outputs = llm.generate(batch_input_ids, sampling_params=sampling_params)

    # inference
    print("inference...")
    outputs = llm.generate(batch_input_ids, sampling_params=sampling_params)

    # speed
    speed = []
    for output in outputs:
        speed.append([len(output.outputs[0].token_ids), (output.metrics.finished_time - output.metrics.first_token_time)])
        print(f"[anwer]:\n {output.outputs[0].text}")

    print(f"\n\n{'='*100}\n\n")
    print(f'[speed]: {np.array(speed)[:,0].sum() / np.array(speed)[:,1].sum()}\n')

    # accepted
    if not args.ar:
        acceptance_counts = [0] * (args.num_spec_tokens + 1)
        for output in outputs:
            for step, count in enumerate(
                    output.metrics.spec_token_acceptance_counts):
                acceptance_counts[step] += count

        print(f"[acceptance length]: {(sum(acceptance_counts) / acceptance_counts[0])}")
    print(f"\n\n{'='*100}\n\n")
    print(args.__dict__)
    print(f"\n\n{'='*100}\n\n")


if __name__ == "__main__":
    main()

Test Result

Target model: Llama-3.1-8B-Instruct
Task: humaneval

Device	Accept Length	Baseline AR t/s	PARD t/s
MI300	5.6	185.86	498.15
H20	5.6	155.75	456.244

aarnphm · 2025-06-17T08:16:58Z

I want @WoosukKwon opinion on this as well! I not too familiar with v0 spec decode, hence I don't have a strong opinion here. Given that the general concensus is that v0 is going to be removed soon, I'm not sure if we would want to accept new features at this point.

Spec Decode in V1 should be supported, at least with eagle3 and medusa (limited support), but iiuc the multi-model KV cache is a requirements for this spec decode method?

zihaoanllm · 2025-06-18T05:25:31Z

Spec Decode in V1 should be supported, at least with eagle3 and medusa (limited support), but iiuc the multi-model KV cache is a requirements for this spec decode method?

While trying to integrate this into v1, I found that it currently seems to only support KV caches with a single shape (code link)? Both PARD and vanilla SpecDec rely on using both a small draft model and large target model, and their attention structures are often different, which makes v1 incompatible for now. The eagle series works because the head structure matches the target model, and Medusa's draft model doesn't use a KV cache.

I'm still getting familiar with vLLM, but I’d be happy to help with integrating this method into v1 in the future. Please feel free to correct me if I’ve misunderstood anything. Thanks!

ekagra-ranjan · 2025-06-19T04:01:13Z

@zihaoanllm - the result on humaneval look good. Could you please also share AL on MTBench which will make it easier to compare with EAGLE-1/3 which we already have?

aarnphm · 2025-06-19T07:36:22Z

Spec Decode in V1 should be supported, at least with eagle3 and medusa (limited support), but iiuc the multi-model KV cache is a requirements for this spec decode method?

While trying to integrate this into v1, I found that it currently seems to only support KV caches with a single shape (code link)? Both PARD and vanilla SpecDec rely on using both a small draft model and large target model, and their attention structures are often different, which makes v1 incompatible for now.

Ah I think we yet to have plan to support draft models that has diff architecture than the target models atm in v1 (sorry about this, should have read through the blogpost more thoroughly 😃)

zihaoanllm · 2025-06-19T10:19:34Z

@zihaoanllm - the result on humaneval look good. Could you please also share AL on MTBench which will make it easier to compare with EAGLE-1/3 which we already have?

I found that the previous test script overcounted one bonus token (it seems a recent PR changed the calculation method). The accepted length on HumanEval should be corrected from 6.6 to 5.6 and 7.2 to 6.2. Throughput is unaffected. I’ve already updated the earlier post accordingly.

Below are the evaluation code and results on MT-Bench, where I’ve also included the earlier results from Eagle for comparison.

evaluation code

# pard.py
import argparse
import json
import os
from vllm import LLM, SamplingParams
import requests
from transformers import AutoTokenizer
from vllm.inputs import TokensPrompt


import numpy as np


os.environ.update({
    "VLLM_USE_V1": "0"
})
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="unsloth/Meta-Llama-3.1-8B-Instruct")
    parser.add_argument("--draft", type=str, default="amd/PARD-Llama-3.2-1B")
    parser.add_argument("--benchmark", type=str, default="humaneval")
    parser.add_argument("--max_num_seqs", type=int, default=1)
    parser.add_argument("--num_prompts", type=int, default=80)
    parser.add_argument("--num_spec_tokens", type=int, default=8)
    parser.add_argument("--tp", type=int, default=1)
    parser.add_argument("-t", "--token", type=int, default=512)
    parser.add_argument("--temp", type=float, default=0)
    parser.add_argument("--ar", action='store_true')
    parser.add_argument("-r", "--reasoning", action='store_true')
    parser.add_argument("--disable-warmup", action='store_true')
    return parser.parse_args()

def main():
    args = parse_args()
    prompts = []
    if args.benchmark == 'mt_bench':
        for line in requests.get(f'https://raw.githubusercontent.com/SafeAILab/EAGLE/refs/heads/main/eagle/data/mt_bench/question.jsonl').text.splitlines():
            if line:
                prompts.append(json.loads(line)['turns'][0])
    else:
        for line in requests.get(f'https://raw.githubusercontent.com/AMD-AIG-AIMA/PARD/master/datas/bmk/{args.benchmark}.jsonl').text.splitlines():
            if line:
                prompts.append(json.loads(line)['data'])
    prompts = prompts[:args.num_prompts]
    datas = [[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}] for prompt in prompts]
    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)
    texts = []
    for data in datas:
        text = tokenizer.apply_chat_template(
            data,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=args.reasoning,
        )
        texts.append(text)
    batch_input_ids = tokenizer(texts, return_attention_mask=False)['input_ids']
    batch_input_ids = [TokensPrompt(prompt_token_ids=ids) for ids in batch_input_ids]

    llm = LLM(
        model=args.model,
        enable_prefix_caching=False,
        tensor_parallel_size=args.tp,
        max_model_len=8192,
        max_num_seqs=args.max_num_seqs,
        gpu_memory_utilization=0.8,
        speculative_config=None if args.ar else {
            "model": args.draft,
            "num_speculative_tokens": args.num_spec_tokens
            },
        compilation_config={
            "splitting_ops": [],
            "compile_sizes": [],
            "cudagraph_capture_sizes": [
                256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,
                 120,112,104,96,88,80,72,64,56,48,40,34,33,32,3130,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1
            ],
            "max_capture_size": 256
        },
    )

    sampling_params = SamplingParams(temperature=args.temp, max_tokens=args.token)

    ## warmup
    if not args.disable_warmup:
        print("warmup...")
        outputs = llm.generate(batch_input_ids, sampling_params=sampling_params)

    # inference
    print("inference...")
    outputs = llm.generate(batch_input_ids, sampling_params=sampling_params)

    # speed
    speed = []
    for output in outputs:
        speed.append([len(output.outputs[0].token_ids), (output.metrics.finished_time - output.metrics.first_token_time)])
        print(f"[anwer]:\n {output.outputs[0].text}")

    print(f"\n\n{'='*100}\n\n")
    print(f'[speed]: {np.array(speed)[:,0].sum() / np.array(speed)[:,1].sum()}\n')

    # accepted
    if not args.ar:
        acceptance_counts = [0] * (args.num_spec_tokens + 1)
        for output in outputs:
            for step, count in enumerate(
                    output.metrics.spec_token_acceptance_counts):
                acceptance_counts[step] += count

        print(f"[acceptance length]: {(sum(acceptance_counts) / acceptance_counts[0])}")
    print(f"\n\n{'='*100}\n\n")
    print(args.__dict__)
    print(f"\n\n{'='*100}\n\n")


if __name__ == "__main__":
    main()

Method	Eval Data Source	num_spec_token	Accept Length
Eagle1	PR-16937	4	2.39
Eagle3	PR-16937	4	3.08
PARD	above code	4	3.18
Eagle1	PR-16937	7	2.47
Eagle3	PR-16937	7	3.54
PARD	above code	7	3.60

ekagra-ranjan · 2025-06-19T15:42:12Z

@zihaoanllm - you are using vLLM V0 metrics when you are computing AL for PARD, right?

V0 metric incorrectly counts AL which leads to higher reported AL as per this doc. Relevant paragraph in the doc

After rejection sampling, the accepted mask is [1, 0, 1]. This mask is computed purely based on the raw probabilities, without accounting for whether previous tokens were accepted. Only t1 should be considered accepted—we stop at the first rejected token (t2), and don’t proceed to t3. So the actual number of accepted tokens is 1.
In V0 - at least with flash_infer - we counted the number of accepted tokens as 2, so this is an area of potential confusion. We lack unit tests in V1 to verify the desired behavior.

Basically, if 3 tokens were proposed and 1st and 3rd token matched then the accepted mask is [1, 0, 1].
V0 counts this as 2 accepted token which is incorrect whereas V1 will count it as 1 which is correct.

zihaoanllm · 2025-06-20T02:25:42Z

@ekagra-ranjan Yes, we're using vLLM v0 for inference. Thanks for pointing out the issue with the AL metric calculation.

To add some context, I’ve also measured AL using other inference methods, and while v0 does tend to report slightly higher AL, the overall impact seems to be minor. Here's a quick comparison:

Benchmark	Num Spec Token	Transformers+ AL	vLLM v0 AL
humaneval	8	5.5	5.6
mt_bench	4	3.16	3.18
mt_bench	7	3.54	3.60

zihaoanllm · 2025-06-24T07:19:31Z

Spec Decode in V1 should be supported, at least with eagle3 and medusa (limited support), but iiuc the multi-model KV cache is a requirements for this spec decode method?

While trying to integrate this into v1, I found that it currently seems to only support KV caches with a single shape (code link)? Both PARD and vanilla SpecDec rely on using both a small draft model and large target model, and their attention structures are often different, which makes v1 incompatible for now.

Ah I think we yet to have plan to support draft models that has diff architecture than the target models atm in v1 (sorry about this, should have read through the blogpost more thoroughly 😃)

@aarnphm @WoosukKwon Just wondering, do you have a rough plan or timeline for when v1 might support using a draft model with a different architecture from the target model? We're quite interested in adopting v1 for our speculative decoding setup, and support for mismatched architectures would be a key enabler for us.

aarnphm · 2025-06-24T08:38:13Z

Just wondering, do you have a rough plan or timeline for when v1 might support using a draft model with a different architecture from the target model? We're quite interested in adopting v1 for our speculative decoding setup, and support for mismatched architectures would be a key enabler for us.

I did talk with Woosuk about supporting separate draft models, and it seems like this would complicates the matter a lot in v1. The main problem with standalone draft model is that it is pretty difficult to maintain the KV cache when the target and the draft models have different KV cache shape. There isn't a guarantee/easy solution for this. Another problem is that will using smaller/similar generations of draft models, you migth want different TP/DP degrees (or implement something like PARD), which in turns is pretty tricky.

V0 circumvented this problem by having a separate "draft" workers, but it is pretty brittle and have a lot more problems.

So I'm not entirely sure if we can support draft model, yet, unless we have a better way to manage KV/address said problems.

zihaoanllm · 2025-06-24T09:11:26Z

I did talk with Woosuk about supporting separate draft models, and it seems like this would complicates the matter a lot in v1. The main problem with standalone draft model is that it is pretty difficult to maintain the KV cache when the target and the draft models have different KV cache shape. There isn't a guarantee/easy solution for this. Another probvlem is that will using smaller/similar generations of draft models, you migth want different TP/DP degrees (or implement something like PARD), which in turns is pretty tricky.

V0 circumvented this problem by having a separate "draft" workers, but it is pretty brittle and have a lot more problems.

So I'm not entirely sure if we can support draft model, yet, unless we have a better way to manage KV/address said problems.

Thanks a lot for the detailed explanation!

Signed-off-by: root <zihaoan2@amd.com> Signed-off-by: <zihaoan2@amd.com>

ekagra-ranjan · 2025-07-17T16:41:30Z

Hi @zihaoanllm - could you share if vllm V0 has any existing script to measure the AL OR there arent't any so you had to write your own here?

zihaoanllm · 2025-07-18T00:13:35Z

Hi @zihaoanllm - could you share if vllm V0 has any existing script to measure the AL OR there arent't any so you had to write your own here?

The current version of vLLM has removed the v0 SPD test script. I am using the test code from an older version.

ekagra-ranjan · 2025-07-18T19:17:35Z

@zihaoanllm - could you pls point me to the old code for computing AL on V0? I tried but couldnt find it.

zihaoanllm · 2025-07-19T00:11:22Z

@ekagra-ranjan https://github.com/vllm-project/vllm/blob/v0.8.4/examples/offline_inference/eagle.py#L96-L105

Neo9061 · 2025-07-21T17:34:32Z

Hi @zihaoanllm any insights on why we close this PR?

zihaoanllm · 2025-07-22T00:44:27Z

Since v0 is deprecated and v1 currently does not support heterogeneous draft models, a new PR will be opened when the time is right. If you need to use v0, please refer to: model/integrate-pard-0521.

mergify bot added the speculative-decoding label May 22, 2025

zihaoanllm added 4 commits June 12, 2025 05:37

[Model] Integrate PARD into vLLM

2efeac5

Signed-off-by: root <anzihao_hh@126.com>

[Model] Integrate PARD into vLLM

b1c6d0b

Signed-off-by: root <anzihao_hh@126.com>

[Model] Integrate PARD into vLLM pre-commit

8bddd03

Signed-off-by: root <anzihao_hh@126.com> Signed-off-by: <anzihao_hh@126.com>

[Model] Integrate PARD into vLLM pre-commit

487e344

Signed-off-by: root <anzihao_hh@126.com> Signed-off-by: <anzihao_hh@126.com>

zihaoanllm force-pushed the model/integrate-pard-0521 branch from 5ee0c1f to 487e344 Compare June 12, 2025 05:44

[Model] Integrate PARD into vLLM add typos.toml

ea13868

Signed-off-by: root <zihaoan2@amd.com> Signed-off-by: <zihaoan2@amd.com>

aarnphm requested a review from WoosukKwon June 17, 2025 08:15

mergify bot added the qwen Related to Qwen models label Jun 18, 2025

zihaoanllm mentioned this pull request Jun 30, 2025

vLLM integration timeline AMD-AGI/PARD#1

Closed

[Model] Integrate PARD into vLLM vocab size

032ad66

Signed-off-by: root <zihaoan2@amd.com> Signed-off-by: <zihaoan2@amd.com>

zihaoanllm closed this Jul 21, 2025

Uh oh!

[Model][Speculative Decoding] Integrate PARD into vLLM #18541

[Model][Speculative Decoding] Integrate PARD into vLLM #18541

Uh oh!

Conversation

zihaoanllm commented May 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Support Model Series

Summary of changes:

1. vllm/spec_decode/multi_step_worker.py:

2. vllm/spec_decode/batch_expansion.py:

3. vllm/spec_decode/spec_decode_worker.py:

Test

Test Code

Test Result

Uh oh!

github-actions bot commented May 22, 2025

Uh oh!

zihaoanllm commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zihaoanllm commented Jun 6, 2025

Uh oh!

aarnphm commented Jun 12, 2025

Uh oh!

zihaoanllm commented Jun 12, 2025

Uh oh!

zihaoanllm commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update Test Result

Test code

Test Result

Uh oh!

aarnphm commented Jun 17, 2025

Uh oh!

zihaoanllm commented Jun 18, 2025

Uh oh!

ekagra-ranjan commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarnphm commented Jun 19, 2025

Uh oh!

zihaoanllm commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekagra-ranjan commented Jun 19, 2025

Uh oh!

zihaoanllm commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zihaoanllm commented Jun 24, 2025

Uh oh!

aarnphm commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zihaoanllm commented Jun 24, 2025

Uh oh!

ekagra-ranjan commented Jul 17, 2025

Uh oh!

zihaoanllm commented Jul 18, 2025

Uh oh!

ekagra-ranjan commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zihaoanllm commented Jul 19, 2025

Uh oh!

Neo9061 commented Jul 21, 2025

Uh oh!

zihaoanllm commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zihaoanllm commented May 22, 2025 •

edited by github-actions bot

Loading

zihaoanllm commented May 26, 2025 •

edited

Loading

zihaoanllm commented Jun 17, 2025 •

edited

Loading

ekagra-ranjan commented Jun 19, 2025 •

edited

Loading

zihaoanllm commented Jun 19, 2025 •

edited

Loading

zihaoanllm commented Jun 20, 2025 •

edited

Loading

aarnphm commented Jun 24, 2025 •

edited

Loading

ekagra-ranjan commented Jul 18, 2025 •

edited

Loading