[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

cadedaniel · 2024-03-07T08:18:31Z

This PR introduces the concept of lookahead scheduling. Lookahead scheduling is where we allocate KV slots for each sequence in a decode batch that do not have any token assigned ("empty slots"). Speculative decoding fills these KV slots with the KV of speculative tokens when running the target model. Furthermore, speculative decoding involving a proposal method that has KV cache will also use the KV slots in normal autoregressive generation.

See these step-by-step examples explaining how Lookahead scheduling works.

Note: we could use scratch space for these KV, however in the case where tokens are accepted we would need to copy the accepted KV from the scratch space to the allocated KV slots. By allocating them ahead-of-time, we save us the complexity of scheduling such a memcpy.

Testing

This PR finishes the copy-on-write scheduler integration from [Core] [Bugfix] Refactor block manager subsystem for better testability #3492 (since now there can be multiple CoW per append_slots).
- A test is added where we verify v1/v2 block manager output equality when there are copy-on-writes: test_v1_v2_greedy_equality_with_cow
test_lookahead_greedy_equality_with_preemption tests equality of generation when lookahead is enabled vs disabled, and includes preemption

Temporary flag

A temporary flag --num-lookahead-slots is added to facilitate testing. It will be removed in PR 6/9 of the speculative decoding oss plan.

cadedaniel · 2024-03-28T02:18:58Z

Ready for review cc @LiuXiaoxuanPKU

vllm/core/block_manager_v2.py

LiuXiaoxuanPKU

Thanks for the PR! Just left some questions.

LiuXiaoxuanPKU · 2024-03-30T03:05:25Z

vllm/core/block_manager_v2.py

+        num_touched_blocks = 0
+        for seq in seq_group.get_seqs(status=SequenceStatus.RUNNING):
+            block_table = self.block_tables[seq.seq_id]
+            num_new_tokens = seq.get_len() - block_table.num_full_slots


what's the definition of num_new_tokens here?

This is the number of tokens that do not have a slot allocated in the block table. Will add a comment.

From my understanding, the num_new_tokens == len(unseen_token_ids)?

yes. I will rename them so it's consistent.

LiuXiaoxuanPKU · 2024-03-30T03:09:38Z

vllm/core/block_manager_v2.py

        self,
        seq: Sequence,
-    ) -> Optional[Tuple[int, int]]:
+        num_lookahead_slots: int,


num_lookahead_slots is not used in this function, is it expetced?

Not expected, will add ensure_num_empty_slots call with num_lookahead_slots + see why the test didn't catch this.

fixed; added test.

vllm/core/block_manager_v2.py

LiuXiaoxuanPKU · 2024-03-30T03:18:46Z

vllm/core/scheduler.py

        )
        return scheduler_outputs

+    def _can_append_slots(self, seq_group: SequenceGroup) -> bool:


Nit: I just feel the two functions below _can_append_slots and _can_swap_in are a bit shallow and do not hide much complexity. Maybe we can call them directly above?

cadedaniel

pr feedback applied @LiuXiaoxuanPKU

cadedaniel · 2024-04-01T06:16:42Z

vllm/core/block_manager_v2.py

+        num_touched_blocks = 0
+        for seq in seq_group.get_seqs(status=SequenceStatus.RUNNING):
+            block_table = self.block_tables[seq.seq_id]
+            num_new_tokens = seq.get_len() - block_table.num_full_slots


cadedaniel · 2024-04-01T07:32:49Z

vllm/core/block_manager_v2.py

        self,
        seq: Sequence,
-    ) -> Optional[Tuple[int, int]]:
+        num_lookahead_slots: int,


fixed; added test.

animan42 · 2024-04-10T07:00:58Z

@cadedaniel really awesome series of changes! I assume the answer is no, but does the draft model also have it's own KV cache? If yes, where is it created and updated?

cadedaniel · 2024-04-10T07:29:23Z

@cadedaniel really awesome series of changes! I assume the answer is no, but does the draft model also have it's own KV cache? If yes, where is it created and updated?

Thanks! Great question. The draft model can have KV, and by default it does. The spec decode worker has a proposer worker which can be a normal vLLM Worker, with KV cache and everything. The number of blocks in the draft KV and target KV is calculated by this function:

vllm/vllm/spec_decode/spec_decode_worker.py

Lines 97 to 116 in b3104b2

    
               def determine_num_available_blocks(self) -> Tuple[int, int]: 
        
                   """Determine the number of cache blocks to use. 
        
                   This is done by profiling the scorer model (which is typically the 
        
                   larger of the two). Then the total memory which would be used by the 
        
                   scorer cache is divided evenly between the proposer and scorer model KV, 
        
                   such that the number of blocks is equal in both KV caches. 
        
                   """ 
        
                   num_gpu_blocks, num_cpu_blocks = ( 
        
                       self.scorer_worker.determine_num_available_blocks()) 
        
                   scorer_cache_block_size_bytes = ( 
        
                       self.scorer_worker.get_cache_block_size_bytes()) 
        
                   proposer_cache_block_size_bytes = ( 
        
                       self.proposer_worker.get_cache_block_size_bytes()) 
        
                   new_num_gpu_blocks = split_num_cache_blocks_evenly( 
        
                       scorer_cache_block_size_bytes, proposer_cache_block_size_bytes, 
        
                       num_gpu_blocks) 
        
                   return new_num_gpu_blocks, num_cpu_blocks

As for the mapping between token ids and block indices, we keep things simple by using the same block mapping for the draft and target models. We can do this because of the code linked above, which ensures that the draft and target KV have the same amount of logical KV space. With this, the KV of the draft model is populated in lockstep with the target model.

There are more details, like how proposal KV are handled, but this is the gist!

xunfeng1980 · 2024-04-13T14:45:20Z

  File "/data/anaconda3/envs/qwen-q/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/anaconda3/envs/qwen-q/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/data/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/data/vllm/vllm/engine/llm_engine.py", line 119, in __init__
    self.model_executor = executor_class(
  File "/data/vllm/vllm/executor/gpu_executor.py", line 37, in __init__
    assert (not speculative_config
AssertionError: Speculative decoding not yet supported for GPU backend

…ing (vllm-project#3250)

cadedaniel force-pushed the multi-step-scheduler branch from fe7d9e5 to b468716 Compare March 28, 2024 01:44

cadedaniel changed the title ~~[WIP] [Speculative decoding 4/9] Scheduler allocates >1 slot per sequence per step~~ [WIP] [Speculative decoding 4/9] Lookahead scheduling for speculative decoding Mar 28, 2024

cadedaniel changed the title ~~[WIP] [Speculative decoding 4/9] Lookahead scheduling for speculative decoding~~ [Speculative decoding 4/9] Lookahead scheduling for speculative decoding Mar 28, 2024

cadedaniel marked this pull request as ready for review March 28, 2024 02:16

LiuXiaoxuanPKU self-assigned this Mar 28, 2024

cadedaniel force-pushed the multi-step-scheduler branch from bcfcc83 to 301603e Compare March 28, 2024 06:26

cadedaniel added 14 commits March 28, 2024 00:07

lookahead scheduling config

22fd72f

wip

619b2f3

wip

58bbfff

wip

41f5642

correctness test

914bb38

wip

6111843

cow

ff3020a

cow test

51cfd73

lint

5874411

docstring

e7e6131

docstring

b2f9ed2

docstring

f709a8d

swap in

a7bfca6

lint

d8837a0

cadedaniel force-pushed the multi-step-scheduler branch from 301603e to d8837a0 Compare March 28, 2024 07:09

cadedaniel added 2 commits March 28, 2024 11:02

Merge branch 'main' into multi-step-scheduler

5163d0f

lint

fc87e9c

cadedaniel mentioned this pull request Mar 29, 2024

[Speculative decoding] Adding configuration object for speculative decoding #3706

Merged

cadedaniel commented Mar 29, 2024

View reviewed changes

vllm/core/block_manager_v2.py Outdated Show resolved Hide resolved

LiuXiaoxuanPKU reviewed Mar 30, 2024

View reviewed changes

cadedaniel added 3 commits March 31, 2024 23:41

pr feedback

7b846bd

test

1e43df9

lint

cd1e0ac

cadedaniel commented Apr 1, 2024

View reviewed changes

cadedaniel added 2 commits April 1, 2024 00:38

fix typo

f56d932

clean up naming

caef5ee

cadedaniel enabled auto-merge (squash) April 1, 2024 18:25

cadedaniel disabled auto-merge April 1, 2024 20:34

remove some leakiness of block table abstraction

f3534e5

LiuXiaoxuanPKU approved these changes Apr 1, 2024

View reviewed changes

cadedaniel added 4 commits April 1, 2024 14:42

wip

b22f4b0

wip

c906fae

wip

a88f8ad

lint

f32ebb0

cadedaniel enabled auto-merge (squash) April 1, 2024 22:02

cadedaniel merged commit 93deb0b into vllm-project:main Apr 1, 2024
33 checks passed

cadedaniel deleted the multi-step-scheduler branch April 1, 2024 22:59

creatorrr mentioned this pull request Apr 2, 2024

Scope for assisted generation? #439

Closed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Speculative decoding 4/9] Lookahead scheduling for speculative decod…

6a3d0e8

…ing (vllm-project#3250)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

cadedaniel commented Mar 7, 2024 •

edited

Loading

cadedaniel commented Mar 28, 2024

LiuXiaoxuanPKU left a comment

LiuXiaoxuanPKU Mar 30, 2024

cadedaniel Mar 30, 2024

LiuXiaoxuanPKU Mar 31, 2024

cadedaniel Mar 31, 2024

cadedaniel Apr 1, 2024

LiuXiaoxuanPKU Mar 30, 2024

cadedaniel Mar 30, 2024

cadedaniel Apr 1, 2024

LiuXiaoxuanPKU Mar 30, 2024

cadedaniel left a comment

cadedaniel Apr 1, 2024

cadedaniel Apr 1, 2024

animan42 commented Apr 10, 2024

cadedaniel commented Apr 10, 2024

xunfeng1980 commented Apr 13, 2024

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

Conversation

cadedaniel commented Mar 7, 2024 • edited Loading

Testing

Temporary flag

cadedaniel commented Mar 28, 2024

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

animan42 commented Apr 10, 2024

cadedaniel commented Apr 10, 2024

xunfeng1980 commented Apr 13, 2024

cadedaniel commented Mar 7, 2024 •

edited

Loading