[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logprobs` with ChunkedPrefill #10132

NickLucche · 2024-11-07T23:39:14Z

Follow-up to #9291, an attempt at fixing prompt_logprobs and enabling hidden state-based speculators (MLP/Medusa).

The main issue with prompt_logprobs is that it changes the output of the mixed prefill-decode batch to have #prompt_tokens+#decode_tokens entries instead of just #sampling_entries (terminal-chunks only), and current code was not accounting for that.
My approach currently relies on splitting prefills and decodes processing to account for that; really open to anything more elegant here.

Regarding hidden states, we have to disregard those coming from non-terminal chunks (logits_processor already discards those, we simply have to adjust code to reflect it) to store the last latent we actually care about.

Benchmarks

Reporting results of benchmarks run on 4xA100-80GB with the following configuration (MultistepSpec regression check):

python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --max-model-len 32768 -tp 4 --speculative-model meta-llama/Meta-Llama-3.1-8B-Instruct  --num-speculative-tokens 4 --speculative-draft-tensor-parallel-size 1 --enable_chunked_prefill True --max_num_batched_tokens 512 --max_num_seqs 32

Benchmark client command

python3 benchmarks/benchmark_serving.py \
    --backend vllm \
    --dataset-name sharegpt \
    --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tokenizer meta-llama/Meta-Llama-3.1-70B-Instruct \
    --num-prompts 20 \
    --endpoint /v1/completions \
    --save-result
--request-rate 2/4/6/8

Median TPOT reported here, slightly worse on median TPOT but slightly better throughput (consistent with MQAscorer too). Overall I'd say performance is similar.

Detail of the request rate=8 follows:

#PR-10132
============ Serving Benchmark Result ============
Successful requests:                     17        
Benchmark duration (s):                  18.81     
Total input tokens:                      2491      
Total generated tokens:                  4055      
Request throughput (req/s):              0.90      
Output token throughput (tok/s):         215.59    
Total Token throughput (tok/s):          348.02    
---------------Time to First Token----------------
Mean TTFT (ms):                          332.06    
Median TTFT (ms):                        279.83    
P99 TTFT (ms):                           720.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.02     
Median TPOT (ms):                        37.88     
P99 TPOT (ms):                           58.45     
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.74    
Median ITL (ms):                         120.01    
P99 ITL (ms):                            226.19    
==================================================

# PRE-PR 
============ Serving Benchmark Result ============
Successful requests:                     17        
Benchmark duration (s):                  18.80     
Total input tokens:                      2491      
Total generated tokens:                  3904      
Request throughput (req/s):              0.90      
Output token throughput (tok/s):         207.71    
Total Token throughput (tok/s):          340.24    
---------------Time to First Token----------------
Mean TTFT (ms):                          320.24    
Median TTFT (ms):                        263.25    
P99 TTFT (ms):                           703.45    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.51     
Median TPOT (ms):                        37.53     
P99 TPOT (ms):                           58.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.03    
Median ITL (ms):                         119.22    
P99 ITL (ms):                            226.25

For the sake of completeness I also run the same TP=4 benchmark on

(newly added) MLPSpeculator
(newly added) MLPSpeculator+logprobs=2
--enforce-eager=True to force MQAScorer (comparison)

All results here https://drive.google.com/file/d/1WOndRnE9STbr7TNNm-jBmHTyARKhkqIc/view?usp=sharing.

github-actions · 2024-11-07T23:39:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-07T23:39:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-27T01:57:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sroy745 · 2024-12-09T13:58:46Z

cc: @tdoublep, who has worked on the MLSpeculator. If you have time, appreciate your review of this PR. Thanks!

sroy745

Thanks for pr!

I am wondering if we can split this pr into 2 - 1) one for enabling MLPSpeculator/Medusa and 2) for enabling the prompt logprobs.

The logic for enabling prompt log probabilities appears to be non-trivial. I'm wondering if this feature (chunked_prefill + sd + prompt_logprobs) is actively being requested. If not, can we consider postponing it for now, given the significant complexity involved in implementing it?

cc: @LiuXiaoxuanPKU

vllm/spec_decode/spec_decode_worker.py

tdoublep · 2024-12-09T17:57:02Z

@sroy745 I should be able to take a look later this week.

NickLucche · 2024-12-10T08:37:19Z

Thanks for reviewing this!

I am wondering if we can split this pr into 2 - 1) one for enabling MLPSpeculator/Medusa and 2) for enabling the prompt logprobs.

No problem on my side, let's wait for a second opinion before tearing the PR.

I'm wondering if this feature (chunked_prefill + sd + prompt_logprobs) is actively being requested.

AFAIK at least folks at IBM have shown immediate interest on this, let's wait for more input on this matter too.

sroy745 · 2024-12-11T13:09:31Z

Thanks for reviewing this!

I am wondering if we can split this pr into 2 - 1) one for enabling MLPSpeculator/Medusa and 2) for enabling the prompt logprobs.

No problem on my side, let's wait for a second opinion before tearing the PR.

I'm wondering if this feature (chunked_prefill + sd + prompt_logprobs) is actively being requested.

AFAIK at least folks at IBM have shown immediate interest on this, let's wait for more input on this matter too.

Thanks. I was not aware of this feature request. SG to include it given the feature request. I will continue with my review.

sroy745

Thanks for PR ! Left some comments.

Since this pr makes changes to the batch_expansion and mqa_scorer I am wondering if we can run the sd benchmark with and without this pr and ensure that there is no impact on the vanilla sd performance?

sroy745 · 2024-12-11T19:03:39Z

vllm/spec_decode/batch_expansion.py

+                # Add all terminal chunks sizes as well as decodes with no
+                # speculation to get out tokens and skip over prompt ones.
+                seq_meta = contracted_seq_group_metadata_list
+                nospec_sizes = torch.tensor([


How are we handling non terminal chunks here? Don't we need to ignore non terminal chunks amongst the prefill sequences? If so how are we ensuring that?

we don't ignore non-terminal chunks here, we actually have to pick their corresponding "output" token (which is always -1) so that it can be post-processed. These -1s are simply discarded later, but we're just complying with current state of post-processing.

Basically it's only needed to have matching number of num_input_request and num_outputs (tokens/probs).

I rephrased the comment to make it hopefully a bit clearer.

vllm/spec_decode/batch_expansion.py

sroy745 · 2024-12-11T19:36:17Z

vllm/spec_decode/mqa_scorer.py

+
+            # Split loop into prefill|decode for readability.
+            start_loc, i = 0, 0
+            while i < len(target_seq_group_metadata_list


I am wondering if we can split this into 2 different separate method - one for handling the prefills and the other for the decodes?

easily, just wanted to highlight the fact that we're still only looping once

on second though, I think the only option to keep things clean and avoid repetitions is to split the prefills|decodes list first and then process each in their own function; but then again, there's not that many lines here anyway..

tests/spec_decode/e2e/test_logprobs.py

tests/spec_decode/e2e/test_multistep_correctness.py

vllm/config.py

sroy745 · 2024-12-12T13:03:04Z

tests/spec_decode/e2e/test_mlp_correctness.py

+    # scheduling on baseline too, we get slightly different logprobs, ending
+    # up sampling different tokens at the tail (ie top tokens don't change).
+    # TL;DR: sd+cp == org+cp but sd+cp != org..is this expected?
+    maybe_enable_chunked_prefill(prefill_chunk_size, baseline_llm_kwargs)


I guess the scheduling changes can change the batching which in turn can lead to a different output? FAQ# 3 here refers to a similar issue https://docs.vllm.ai/en/latest/usage/faq.html.

yeah that was my guess too

QQ: if we are using greedy decoding, should only one token has probability 1, all other tokens have probability 0?

we can sample greedily but output is still a regular distr; regardless in this test we also compare the ranking of the tokens that were not sampled (not top1) and their prob

Hmm I've seen issues like that quite frequently with fp16, but normally goes away in fp32 (which I think this test in running in). There is still no guarantee you will get exactly the same logprobs though.

NickLucche · 2024-12-12T13:45:09Z

Sure good idea, let me address the review changes then I can post some numbers on that

LiuXiaoxuanPKU · 2024-12-12T21:16:45Z

vllm/spec_decode/spec_decode_worker.py

+                       seq_group_meta.token_chunk_size)
+                prompt_token_ids = prompt_token_ids[start:end]
+                prompt_logprobs = [
+                    create_logprobs_output(


QQ: if the user does not need logprobs, why do we still create a fake logprobs here?

tbh I am not sure either, I suppose it's just to comply with post processing code. I would love to remove it, but we'd need a separate PR 'cause it's already there. Maybe @tjohnson31415 knows about this.

NickLucche · 2024-12-16T09:06:46Z

I've added tests for Medusa (so that the PR content actually matches its title) but disabled CP compat with EAGLE, as we still have some issues to address there.
I'd rather have that in a separate contribution since this was one has already grown to an unpleasant size.

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

tdoublep

Thanks for adding support for this! I have a few questions.

May be missing something, but it seems like most of the changes here related to enabling prompt logprobs with chunked prefill + spec decode, and the changes related to MLPSpec/Medusa (e.g., the hidden states stuff) is a relatively small piece?

tdoublep · 2024-12-18T16:37:33Z

tests/spec_decode/e2e/test_mlp_correctness.py

+    # scheduling on baseline too, we get slightly different logprobs, ending
+    # up sampling different tokens at the tail (ie top tokens don't change).
+    # TL;DR: sd+cp == org+cp but sd+cp != org..is this expected?
+    maybe_enable_chunked_prefill(prefill_chunk_size, baseline_llm_kwargs)


Hmm I've seen issues like that quite frequently with fp16, but normally goes away in fp32 (which I think this test in running in). There is still no guarantee you will get exactly the same logprobs though.

tdoublep · 2024-12-18T16:39:33Z

tests/spec_decode/e2e/test_mlp_correctness.py

@@ -418,15 +441,19 @@ def test_mlp_different_k(vllm_runner, common_llm_kwargs,
        # Use smaller output len for fast test.
        32,
    ])
+# test with chunk size >= `speculative_disable_by_batch_size`


Why would this be a case that we need to test? Isn't prefill_chunk_size measured in tokens, and the speculative_disable_by_batch_size measured in number of sequences?

Good point! I will change the comment, I was referring to the arg to maybe_enable_chunked_prefill, which sets both number of tokens as well as number of sequences. In practice you end up "converging" to batch_size=max_num_seqs=prefill_chunk_size as you get all decodes with size 1.

Could you update the comment here since it's confused.
Also why is max_num_seqs = prefill_chunk_size?

No reason, this is just an arbitrary test value that I kept from existing tests

tdoublep · 2024-12-18T16:49:34Z

vllm/spec_decode/interfaces.py

+    # Scoring model may also return logprobs for prompt tokens
+    # for each request, when chunked prefill is enabled.


Can we also generate logprobs using spec decode if chunked prefill is not enabled?

Yes previously available features shouldn't be affected by this change. We were able to get away with fewer loc because we either had all prompts (so no spec, here https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/spec_decode_worker.py#L654) or all decodes.

In particular, addition here was necessary because the Speculator now runs prefills too, so it needed a way to report the prompt_logprobs back to the worker.

tdoublep · 2024-12-19T07:58:45Z

vllm/spec_decode/spec_decode_worker.py

+                start = 1 if seq_data._num_computed_tokens == 0 \
+                    else seq_data._num_computed_tokens


Why do we skip first location when seq_data._num_computed_tokens==0 but not otherwise?

we have no prob for the first token of the prompt, p(x_i|x_i-1... )=>there is no prior p(x0) .
Same logic as https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/spec_decode_worker.py#L591.

NickLucche · 2024-12-19T09:40:36Z

Thanks for the review!

it seems like most of the changes here related to enabling prompt logprobs

I am afraid so, I was hoping support for plogs could be added with less effort, but it ended up having to be quite invasive.

sroy745

LGTM. Left one comment about a TP > 1 test. PTAL

I am wondering if we could do one round of testing around the following

Compare the TPOT with and without this pr for vanilla sd runs (with and without prompt logprobs) and make sure there is no regression.
Do a sanity run for an MLP Speculator with target model tp>=1
Do a sanity run for sanity run for chunked_prefill + sd for regular draft model + target model tp >=1 (with and without logprobs)

tests/spec_decode/e2e/test_logprobs.py

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-01-14T18:26:07Z

I've added the benchmark results @sroy745, let me know what you think

sroy745 · 2025-01-15T04:19:17Z

Thanks for the sharing the results. LGTM. There are some spec_decoding test failures. PTAL

Signed-off-by: NickLucche <nlucches@redhat.com>

tests/spec_decode/e2e/test_logprobs.py

njhill · 2025-01-23T18:48:57Z

Thanks for the heroic efforts on this @NickLucche, and the detailed reviews @sroy745 @tdoublep.

njhill · 2025-01-23T21:56:42Z

@LiuXiaoxuanPKU is giving this a final look over.

LiuXiaoxuanPKU

Thanks for the work here. I'm good with it now, some very minor things.

tests/spec_decode/e2e/conftest.py

LiuXiaoxuanPKU · 2025-01-24T06:13:10Z

tests/spec_decode/e2e/test_mlp_correctness.py

@@ -418,15 +441,19 @@ def test_mlp_different_k(vllm_runner, common_llm_kwargs,
        # Use smaller output len for fast test.
        32,
    ])
+# test with chunk size >= `speculative_disable_by_batch_size`


Could you update the comment here since it's confused.
Also why is max_num_seqs = prefill_chunk_size?

Signed-off-by: NickLucche <nlucches@redhat.com>

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com> Signed-off-by: Isotr0py <2037008807@qq.com>

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>

mergify bot added the needs-rebase label Nov 7, 2024

NickLucche marked this pull request as ready for review November 26, 2024 14:57

NickLucche requested review from njhill, LiuXiaoxuanPKU, zhuohan123, youkaichao, alexm-redhat and comaniac as code owners November 26, 2024 14:57

NickLucche force-pushed the fix-prompt-logprobs-mlpspec branch from 45b4e73 to b423f32 Compare November 26, 2024 18:37

mergify bot removed the needs-rebase label Nov 26, 2024

mergify bot added the needs-rebase label Nov 27, 2024

NickLucche force-pushed the fix-prompt-logprobs-mlpspec branch from b423f32 to 88bba37 Compare November 27, 2024 09:31

mergify bot removed the needs-rebase label Nov 27, 2024

sroy745 reviewed Dec 9, 2024

View reviewed changes

vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved

vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved

vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved

sroy745 reviewed Dec 12, 2024

View reviewed changes

LiuXiaoxuanPKU reviewed Dec 12, 2024

View reviewed changes

NickLucche and others added 5 commits December 18, 2024 12:49

squashed all

f2bbdd3

Signed-off-by: NickLucche <nlucches@redhat.com>

fix prompt_logprobs for prefill only path when disable_logprobs is set

18ec02f

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: wallashss <wallashss@ibm.com> Co-authored-by: wallashss <wallashss@ibm.com>

keep decodes and terminal prefill latents

539d58e

Signed-off-by: NickLucche <nlucches@redhat.com>

note on logprobs test sensibility

c2a6431

Signed-off-by: NickLucche <nlucches@redhat.com>

clean up and bash format

b67335f

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche added 2 commits December 18, 2024 12:49

refactor non_spec processing to improve clarity

700d75c

Signed-off-by: NickLucche <nlucches@redhat.com>

format

bbcc807

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the fix-prompt-logprobs-mlpspec branch from d3131b6 to bbcc807 Compare December 18, 2024 12:49

tdoublep reviewed Dec 19, 2024

View reviewed changes

sroy745 reviewed Jan 10, 2025

View reviewed changes

tests/spec_decode/e2e/test_logprobs.py Show resolved Hide resolved

joerunde added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 13, 2025

fix seq_id and token id in prefill output

e3475c7

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche added 5 commits January 15, 2025 09:47

fix tests

7b77e98

Signed-off-by: NickLucche <nlucches@redhat.com>

format

f0cb9f5

Signed-off-by: NickLucche <nlucches@redhat.com>

add logprobs tp2 test

57d5b9f

Signed-off-by: NickLucche <nlucches@redhat.com>

looser check on logprobs tokens

fd24001

Signed-off-by: NickLucche <nlucches@redhat.com>

fix tests

ae1cd31

Signed-off-by: NickLucche <nlucches@redhat.com>

sroy745 reviewed Jan 21, 2025

View reviewed changes

tests/spec_decode/e2e/test_logprobs.py Show resolved Hide resolved

Merge branch 'vllm-project:main' into fix-prompt-logprobs-mlpspec

54f1945

sroy745 approved these changes Jan 21, 2025

View reviewed changes

LiuXiaoxuanPKU approved these changes Jan 24, 2025

View reviewed changes

NickLucche and others added 2 commits January 27, 2025 08:53

minor

60cbdd4

Signed-off-by: NickLucche <nlucches@redhat.com>

Merge branch 'vllm-project:main' into fix-prompt-logprobs-mlpspec

84e421a

njhill merged commit 6116ca8 into vllm-project:main Jan 27, 2025
49 checks passed

		# Scoring model may also return logprobs for prompt tokens
		# for each request, when chunked prefill is enabled.

		start = 1 if seq_data._num_computed_tokens == 0 \
		else seq_data._num_computed_tokens

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill #10132

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill #10132

Conversation

NickLucche commented Nov 7, 2024 • edited by github-actions bot Loading

Benchmarks

github-actions bot commented Nov 7, 2024

mergify bot commented Nov 7, 2024

mergify bot commented Nov 27, 2024

sroy745 commented Dec 9, 2024

sroy745 left a comment • edited Loading

Choose a reason for hiding this comment

tdoublep commented Dec 9, 2024

NickLucche commented Dec 10, 2024

sroy745 commented Dec 11, 2024 • edited Loading

sroy745 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLucche Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLucche commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLucche commented Dec 16, 2024

tdoublep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLucche commented Dec 19, 2024

sroy745 left a comment

Choose a reason for hiding this comment

NickLucche commented Jan 14, 2025

sroy745 commented Jan 15, 2025

njhill commented Jan 23, 2025

njhill commented Jan 23, 2025

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logprobs` with ChunkedPrefill #10132

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logprobs` with ChunkedPrefill #10132

NickLucche commented Nov 7, 2024 •

edited by github-actions bot

Loading

sroy745 left a comment •

edited

Loading

sroy745 commented Dec 11, 2024 •

edited

Loading

NickLucche Dec 16, 2024 •

edited

Loading