[Performance] [Speculative decoding] Support draft model on different tensor-parallel size than target model #4933

GeauxEric · 2024-05-21T05:03:01Z

FIX #4632

cadedaniel · 2024-05-23T17:41:33Z

Thanks! Ping when when this PR is ready!

wooyeonlee0 · 2024-05-27T00:46:24Z

Great! I'm looking forward to this feature :)

GeauxEric · 2024-05-27T22:16:13Z

Not very familiar with distributed training and inference, so I spent some time reading the code base.

@cadedaniel, I got two questions about the expected behavior.

First question. For the proposal model, does its TP_size also need to conform world size == TP_size * PP_size?, if that is the case, when TP_size == 1, we will have PP_size > 0, which is not supported yet. A temporary solution is to let each worker have its own copy of the small draft model.

Second question. If we do need to perform distributed inference of the draft model, since there are two models (scoring and proposal) now, that means the function initialize_model_parallel needs to be invoked twice for each worker? One for scoring model and one for proposal model:

def initialize_model_parallel(
    tensor_model_parallel_size: int = 1,
    pipeline_model_parallel_size: int = 1,
    backend: Optional[str] = None,
) -> None:
    """
    Initialize model parallel groups.

cadedaniel · 2024-06-03T20:15:12Z

First question. For the proposal model, does its TP_size also need to conform world size == TP_size * PP_size?, if that is the case, when TP_size == 1, we will have PP_size > 0, which is not supported yet. A temporary solution is to let each worker have its own copy of the small draft model.

Let's leave out the PP case for now. In the future we can add more configurations that benefit PP latency. You can assume PP size is always 1.

Second question. If we do need to perform distributed inference of the draft model, since there are two models (scoring and proposal) now, that means the function initialize_model_parallel needs to be invoked twice for each worker? One for scoring model and one for proposal model:

Good question.. in my internal fork we had an ability to skip initialization the second time. See should_init_distributed_env=False in the following code.

    def init_model(self):
        """Initialize the model on all ranks.
        This also creates a single-rank process group containing only the
        self process.
        """
        world_rank = torch.distributed.get_rank()
        self._single_tp_group = torch.distributed.new_group([world_rank])

        with patch_tensor_parallel_group(self._single_tp_group):
            self._worker.init_model(should_init_distributed_env=False)

Then in the spec decode worker we initialize the larger model first.

vllm/vllm/spec_decode/spec_decode_worker.py

Lines 160 to 163 in cafb8e0

    
           # The scorer worker model is initialized first in case the proposer 
        
           # model has a smaller TP degree than the target worker. 
        
           self.scorer_worker.init_device() 
        
           self.proposer_worker.init_device()

EricDingNVD added 2 commits May 20, 2024 21:58

[speculative decoding] add argument for draft model tensor parallel size

a9bf47c

format

ac3164d

GeauxEric marked this pull request as draft May 21, 2024 05:03

rkooo567 requested a review from cadedaniel May 21, 2024 10:40

GeauxEric added 2 commits May 21, 2024 21:56

default to target model's tp size

9a30354

add unittest

27a449e

GeauxEric mentioned this pull request May 27, 2024

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632

Closed

GeauxEric marked this pull request as ready for review May 27, 2024 22:16

GeauxEric marked this pull request as draft June 4, 2024 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] [Speculative decoding] Support draft model on different tensor-parallel size than target model #4933

[Performance] [Speculative decoding] Support draft model on different tensor-parallel size than target model #4933

GeauxEric commented May 21, 2024

cadedaniel commented May 23, 2024

wooyeonlee0 commented May 27, 2024

GeauxEric commented May 27, 2024 •

edited

Loading

cadedaniel commented Jun 3, 2024 •

edited

Loading

[Performance] [Speculative decoding] Support draft model on different tensor-parallel size than target model #4933

Are you sure you want to change the base?

[Performance] [Speculative decoding] Support draft model on different tensor-parallel size than target model #4933

Conversation

GeauxEric commented May 21, 2024

cadedaniel commented May 23, 2024

wooyeonlee0 commented May 27, 2024

GeauxEric commented May 27, 2024 • edited Loading

cadedaniel commented Jun 3, 2024 • edited Loading

GeauxEric commented May 27, 2024 •

edited

Loading

cadedaniel commented Jun 3, 2024 •

edited

Loading