[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling #3103

cadedaniel · 2024-02-29T07:46:28Z

This PR implements a vLLM Worker which invokes a draft worker to obtain proposals, invokes a target worker to obtain probabilities of each proposal, and then applies rejection sampling to accept/reject each speculated token. It is a part of the speculative decoding contribution by Anyscale to vLLM, see #2188 for more info.

High-level design

The high-level design is as follows:

Currently, only the "draft model" approach to speculative decoding is implemented with top-1 proposals from the draft model, and lossless rejection sampling. In the future, other proposal approaches may be added, such as Medusa/Eagle (requiring top-k proposals/tree attention scoring), Lookahead, RAG, etc. The key contribution of this PR is a light framework for proposing, scoring, and verifying speculative tokens using non-contiguous KV memory.

Notes for reviewers

The worker has decent unit test coverage but does not yet work end-to-end with the LLMEngine. That will come in a followup PRs 4/9 and 6/9 in the open sourcing plan.
Some work remains on my part to clean this up for review. Namely:
- Port over proposal routine for the multi-step-worker (currently mocked)
- Decouple "batch expansion" (see below) from the proposal/scoring/verification scheme. This will allow future contribution of MQA for scoring.
- Documentation
- General cleanup of messy bits from open sourcing

What is "batch expansion"?

This PR does not use MQA for scoring proposal tokens. Instead, it uses the single-query PagedAttention kernel (aka, normal vLLM decode attention) to perform scoring of the proposal tokens. This was done because at the time of implementation, we did not yet have performant MQA kernels for non-contiguous KV memory. We now have an abundance of these (notably, FlashAttention and FlashInfer, along with Triton implementations, e.g. in #2607). Batch expansion should be replaced by these to obtain some efficiency gain in verification time.

More details on batch expansion and the optimization opportunity can be found here.

robertgshaw2-redhat · 2024-02-29T16:18:02Z

cool!

cadedaniel · 2024-03-06T06:43:46Z

Ready for review. cc @LiuXiaoxuanPKU @ymwangg @robertgshaw2-neuralmagic @Yard1

LiuXiaoxuanPKU

Thanks for the great work! Just some minor questions & comments.

LiuXiaoxuanPKU · 2024-03-09T00:39:46Z

vllm/spec_decode/spec_decode_worker.py

+            i for i, (_, proposal_len) in enumerate(
+                zip(seq_group_metadata_list, proposal_lens_list))
+            if proposal_len == 0
+        ]


Nit: we can merge the two for loops above.

Seems there are two concerns here:

performance of two loops

readability of two loops

I'll defer performance optimization until later; I'll put these into a helper function to make it easier to read.

LiuXiaoxuanPKU · 2024-03-09T00:44:33Z

vllm/spec_decode/spec_decode_worker.py

@@ -0,0 +1,347 @@
+from typing import List, Tuple, Optional, Dict


Currently, there is some complexity by separating the ways we can spec and non_spec sequences. In the future, we will remove the complexity by introducing variable proposed length and flashInfer kernel. Maybe we can add some comments about this?

Good point, will add some comment!

LiuXiaoxuanPKU · 2024-03-09T00:45:42Z

vllm/spec_decode/spec_decode_worker.py

+        accepted_token_ids: torch.Tensor,  # shape: [batch_size, k+1]
+        k: int,
+    ) -> List[SamplerOutput]:
+        """Given the accepted token ids, create a list of SamplerOutput.


Thanks for adding comments for almost all the functions! Really appreciate it!

LiuXiaoxuanPKU · 2024-03-09T00:48:52Z

vllm/spec_decode/multi_step_worker.py

+            seq_data = next(iter(seq_group_metadata.seq_data.values()))
+            seq_len = seq_data.get_len()
+
+            if seq_len + max_proposal_len < self._max_model_len:


Maybe add a comment here saying
(1) we want to address the different model length between the draft and target model.
(2) the proposal_lens can only be max_proposal_len or 0 for now. It can not be length between 0 and max_proposal_len.

Good points, will add!

LiuXiaoxuanPKU · 2024-03-09T00:51:04Z

vllm/spec_decode/batch_expansion.py

+
+        if non_spec_indices:
+            all_tokens[non_spec_indices, 0] = non_spec_target_token_ids
+            all_probs[non_spec_indices, 1:, :] = non_spec_target_probs


A bit confused by the 1 here, why starting from 1?

This is a bug! Should be :1. Saved me some headache during correctness testing, thank you 😄

Yard1 · 2024-03-09T01:47:08Z

vllm/spec_decode/batch_expansion.py

+        all_tokens = torch.ones(
+            original_bs, k + 1, device=self._device, dtype=torch.long) * -1


instead of torch.ones * -1, do torch.full with -1 as the fill value

added, thanks

…es rejection sampling (vllm-project#3103)

MMuzzammil1 · 2024-09-13T09:46:29Z

Hi, I was wondering why the Batch Expansion is designed to support only top-1 cases and not tree-based batches, any specific reasons for that?

cadedaniel added 18 commits February 23, 2024 21:17

first test passes

d74ff5c

test

74b5c48

test

f6a730b

test

aafebd0

test

415db01

test

76dfe1a

test

069b564

test for metrics, separate out metrics functionality

7f13ccd

metrics test

c91a55b

clean

3a69d54

test

73212b7

fixes

f796ab0

nvtx_range

384dc9d

profile and cache tests

2d1a192

test

bec5cba

Merge remote-tracking branch 'upstream/main' into draft-target-worker

273baea

lint

7a18f37

attempt add tests to ci

4eb8e04

sighingnow mentioned this pull request Mar 4, 2024

Introduce speculative decoding with draft models to vLLM #3029

Closed

cadedaniel added 10 commits March 4, 2024 12:19

refactor outline

e0ec4b4

wip

b7e580b

WIP

665ed8e

sampler mock raw tensors

8fcb257

wip

79a1f6c

asd

7a42183

asd

c86b44e

wip

f5e5d76

bugfix

68284ed

wip

87cc31a

cadedaniel added 12 commits March 5, 2024 19:38

wip

c387d56

move

39382b7

move

f78325c

move

5cafc12

name

3a5dcfb

sequence test and docs

8e7ee97

docs

e5e334f

lint

2d27c57

typo

3c61a52

Merge remote-tracking branch 'upstream/main' into draft-target-worker

be66076

lint

b165a73

fix test

17725fb

cadedaniel changed the title ~~[WIP] [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling~~ [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling Mar 6, 2024

cadedaniel marked this pull request as ready for review March 6, 2024 06:42

cadedaniel requested review from Yard1, LiuXiaoxuanPKU, zhuohan123 and WoosukKwon March 6, 2024 06:42

cadedaniel requested a review from pcmoritz March 6, 2024 06:43

Merge remote-tracking branch 'upstream/main' into draft-target-worker

364a415

LiuXiaoxuanPKU approved these changes Mar 9, 2024

View reviewed changes

Yard1 reviewed Mar 9, 2024

View reviewed changes

cadedaniel added 2 commits March 8, 2024 17:58

pr feedback

11b6f39

better comment

ab00fcf

LiuXiaoxuanPKU merged commit 8437bae into vllm-project:main Mar 9, 2024
24 checks passed

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

[Speculative decoding 3/9] Worker which speculates, scores, and appli…

dcd1543

…es rejection sampling (vllm-project#3103)

cermeng mentioned this pull request Aug 16, 2024

[Feature]: Benchmark script with speculative decode metrics #7586

Closed

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Speculative decoding 3/9] Worker which speculates, scores, and appli…

0a88d0f

…es rejection sampling (vllm-project#3103)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling #3103

[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling #3103

cadedaniel commented Feb 29, 2024 •

edited

Loading

robertgshaw2-redhat commented Feb 29, 2024

cadedaniel commented Mar 6, 2024

LiuXiaoxuanPKU left a comment

LiuXiaoxuanPKU Mar 9, 2024

cadedaniel Mar 9, 2024

LiuXiaoxuanPKU Mar 9, 2024

cadedaniel Mar 9, 2024

LiuXiaoxuanPKU Mar 9, 2024

cadedaniel Mar 9, 2024

LiuXiaoxuanPKU Mar 9, 2024

cadedaniel Mar 9, 2024

LiuXiaoxuanPKU Mar 9, 2024

cadedaniel Mar 9, 2024

Yard1 Mar 9, 2024

cadedaniel Mar 9, 2024

MMuzzammil1 commented Sep 13, 2024

		@@ -0,0 +1,347 @@
		from typing import List, Tuple, Optional, Dict

		all_tokens = torch.ones(
		original_bs, k + 1, device=self._device, dtype=torch.long) * -1

[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling #3103

[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling #3103

Conversation

cadedaniel commented Feb 29, 2024 • edited Loading

High-level design

Notes for reviewers

What is "batch expansion"?

robertgshaw2-redhat commented Feb 29, 2024

cadedaniel commented Mar 6, 2024

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MMuzzammil1 commented Sep 13, 2024

cadedaniel commented Feb 29, 2024 •

edited

Loading