[Feature] Generation Inputs: input_embeds #745

AlekseyKorshuk · 2024-07-26T11:55:30Z

Motivation

I propose to add input_embeds as an optional input to the generation params.

Why is this important

Nowadays there are a lot of Vision Language Models (VLMs) and they all have similar architecture: vision tower, projector, LLM. This means vision_tower+projector just prepares embeddings for "image" tokens. So why not allow model developers to handle by themselves the preparation of input_embeds for the LLM?
Lots of new models tend to allow the user to work with bounding boxes and segmentation masks like PaliGemma and Florence, making it quite complicated to add different processors and conversation templates to the codebase.
By allowing the user to provide input_embeds instead of list of messages or text prompts, you reduce your own headache in the future.
Another point is that VLM developers can focus on caching image embeddings while building on top of the SGLang, allowing even higher throughput.

vLLM users required this feature long time ago and this topic gained a lot of positive attention from the community:

Support generation from input embedding vllm-project/vllm#1265

This unique feature will make the SGLang the main framework for all VLMs.

I am happy to help implement this if you direct me in the codebase and thank you for your time and consideration 🤗

Proposed usages

response = client.chat.completions.create(
    model="default",
    input_embeds=[...],
    temperature=0.8,
    max_tokens=64,
)

backend.run(input_embeds=input_embeds)

@dataclass
class GenerateReqInput:
    # The input prompt. It can be a single prompt or a batch of prompts.
    text: Optional[Union[List[str], str]] = None
    # The token ids for text; one can either specify text or input_ids.
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
    # The embeddings for input_ids; if specified, input_ids should also be provided
    input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
    image_data: Optional[Union[List[str], str]] = None
    # The sampling_params.
    sampling_params: Union[List[Dict], Dict] = None
    # The request id.
    rid: Optional[Union[List[str], str]] = None
    # Whether to return logprobs.
    return_logprob: Optional[Union[List[bool], bool]] = None
    # The start location of the prompt for return_logprob.
    logprob_start_len: Optional[Union[List[int], int]] = None
    # The number of top logprobs to return.
    top_logprobs_num: Optional[Union[List[int], int]] = None
    # Whether to detokenize tokens in logprobs.
    return_text_in_logprobs: bool = False
    # Whether to stream output.
    stream: bool = False

Related resources

The text was updated successfully, but these errors were encountered:

joshpxyne · 2024-07-26T16:46:50Z

+1!

jsdir · 2024-07-26T16:49:20Z

+1

tunahfishy · 2024-07-26T18:42:05Z

!!!

ummagumm-a · 2024-07-27T10:45:49Z

having this feature would be nice, indeed

merrymercy · 2024-07-27T12:26:52Z

Great suggestions. Let's prioritize this one. I can share some ideas and pointers.

High-level Idea

Since many parts of the existing code rely on the concept of "input_ids: List[int]," it is not easy to fully change all of them, as this will create many problematic "if/else" conditions. I think one possible implementation idea is to create some random fake "input_ids" to make most of the existing code runnable. Then, during the actual forward pass, we can feed input_embeds instead of calling the embedding layer to encode input_ids.

You can learn more about this idea by looking at how the existing Llava implementation directly feeds input_embeds into the underlying Llama:

sglang/python/sglang/srt/models/llava.py

Lines 241 to 243 in 0736b27

    
           return self.language_model( 
        
               input_ids, positions, input_metadata, input_embeds=input_embeds 
        
           )

sglang/python/sglang/srt/models/llama2.py

Lines 258 to 261 in 0736b27

    
           if input_embeds is None: 
        
               hidden_states = self.embed_tokens(input_ids) 
        
           else: 
        
               hidden_states = input_embeds

Implementation

The inference of a request starts with GenerateReqInput from the HTTP server, then it will go through several important classes: TokenizerManager, ModelTpServer, ModelRunner, Req, nferBatch. To implement your change, we need to update these places.

Implement your proposed changes to GenerateReqInput

sglang/python/sglang/srt/managers/io_struct.py

Line 15 in 3fdab91

class GenerateReqInput:

Skip the input tokenization in TokenizerManager

sglang/python/sglang/srt/managers/tokenizer_manager.py

Lines 142 to 148 in 0736b27

    
           input_ids = ( 
        
               self.tokenizer.encode(input_text) 
        
               if obj.input_ids is None 
        
               else obj.input_ids 
        
           ) 
        
           if index is not None and obj.input_ids: 
        
               input_ids = obj.input_ids[index]

When creating the Req, record the input_embeds. Maybe here is also a good place to generate the fake input_ids mentioned above.

sglang/python/sglang/srt/managers/controller/tp_worker.py

Line 263 in 0736b27

req = Req(recv_req.rid, recv_req.input_text, recv_req.input_ids)
When preparing the inputs of a prefill batch. Save input_embeds into InferBatch. In SGLang, "prefill" is also called "extend".

sglang/python/sglang/srt/managers/controller/infer_batch.py

Line 313 in 0736b27

def prepare_for_extend(self, vocab_size: int, int_token_logit_bias: torch.Tensor):

When running the actual forward pass. Feed input_embeds to the model,

sglang/python/sglang/srt/managers/controller/model_runner.py

Lines 295 to 309 in 0736b27

    
           def forward_extend(self, batch: Batch): 
        
               input_metadata = InputMetadata.create( 
        
                   self, 
        
                   forward_mode=ForwardMode.EXTEND, 
        
                   req_pool_indices=batch.req_pool_indices, 
        
                   seq_lens=batch.seq_lens, 
        
                   prefix_lens=batch.prefix_lens, 
        
                   position_ids_offsets=batch.position_ids_offsets, 
        
                   out_cache_loc=batch.out_cache_loc, 
        
                   top_logprobs_nums=batch.top_logprobs_nums, 
        
                   return_logprob=batch.return_logprob, 
        
               ) 
        
               return self.model.forward( 
        
                   batch.input_ids, input_metadata.positions, input_metadata 
        
               )

This is my rough idea. I haven't implemented it yet, so there may be some mistakes. I hope it is helpful.

Ying1123 · 2024-08-04T19:36:31Z

@AlekseyKorshuk any updates?

AlekseyKorshuk · 2024-08-04T20:09:20Z

Last week was quite busy for me, so unfortunately have not started yet

github-actions · 2024-10-04T01:11:20Z

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

RinRin-32 · 2024-10-15T10:25:42Z

Great suggestions. Let's prioritize this one. I can share some ideas and pointers.

High-level Idea

Since many parts of the existing code rely on the concept of "input_ids: List[int]," it is not easy to fully change all of them, as this will create many problematic "if/else" conditions. I think one possible implementation idea is to create some random fake "input_ids" to make most of the existing code runnable. Then, during the actual forward pass, we can feed input_embeds instead of calling the embedding layer to encode input_ids.

You can learn more about this idea by looking at how the existing Llava implementation directly feeds input_embeds into the underlying Llama:

sglang/python/sglang/srt/models/llava.py

Lines 241 to 243 in 0736b27

return self.language_model(

input_ids, positions, input_metadata, input_embeds=input_embeds

)

sglang/python/sglang/srt/models/llama2.py

Lines 258 to 261 in 0736b27

if input_embeds is None:

hidden_states = self.embed_tokens(input_ids)

else:

hidden_states = input_embeds

Implementation

The inference of a request starts with GenerateReqInput from the HTTP server, then it will go through several important classes: TokenizerManager, ModelTpServer, ModelRunner, Req, nferBatch. To implement your change, we need to update these places.
1. Implement your proposed changes to GenerateReqInput https://github.com/sgl-project/sglang/blob/3fdab91912fb271c20642e21c2055df0e23d514e/python/sglang/srt/managers/io_struct.py#L15

2. Skip the input tokenization in `TokenizerManager` https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/tokenizer_manager.py#L142-L148

3. When creating the `Req`, record the `input_embeds`. Maybe here is also a good place to generate the fake input_ids mentioned above. https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/controller/tp_worker.py#L263

4. When preparing the inputs of a prefill batch. Save input_embeds into `InferBatch`. In SGLang, "prefill" is also called "extend". https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/controller/infer_batch.py#L313

5. When running the actual forward pass. Feed `input_embeds` to the model, https://github.com/sgl-project/sglang/blob/0736b270202696b8f865e2915aadc36d3d51811b/python/sglang/srt/managers/controller/model_runner.py#L295-L309
This is my rough idea. I haven't implemented it yet, so there may be some mistakes. I hope it is helpful.

Hello, I implemented accordingly to this high level overview and managed to get input_embeds working/generating response.

My current issue is that I can only generate using input_embeds once, if I use input_embeds to generate again I get this error:

[08:49:39 TP0] Traceback (most recent call last): File "/data/rin_experiements/sglang/python/sglang/srt/managers/scheduler.py", line 994, in run_scheduler_process scheduler.event_loop() File "/data/rin_experiements/sglang/python/sglang/srt/managers/scheduler.py", line 242, in event_loop self.forward_step() File "/home/azureuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/data/rin_experiements/sglang/python/sglang/srt/managers/scheduler.py", line 292, in forward_step self.forward_prefill_batch(new_batch) File "/data/rin_experiements/sglang/python/sglang/srt/managers/scheduler.py", line 592, in forward_prefill_batch logits_output, next_token_ids = self.tp_worker.forward_batch_generation( File "/data/rin_experiements/sglang/python/sglang/srt/managers/tp_worker.py", line 114, in forward_batch_generation logits_output = self.model_runner.forward(forward_batch) File "/data/rin_experiements/sglang/python/sglang/srt/model_executor/model_runner.py", line 521, in forward return self.forward_extend(forward_batch) File "/data/rin_experiements/sglang/python/sglang/srt/model_executor/model_runner.py", line 496, in forward_extend return self.model.forward( File "/home/azureuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/data/rin_experiements/sglang/python/sglang/srt/models/qwen2.py", line 290, in forward hidden_states = self.model(input_ids, positions, forward_batch, input_embeds) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/data/rin_experiements/sglang/python/sglang/srt/models/qwen2.py", line 256, in forward hidden_states, residual = layer( File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/data/rin_experiements/sglang/python/sglang/srt/models/qwen2.py", line 208, in forward hidden_states = self.self_attn( File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/data/rin_experiements/sglang/python/sglang/srt/models/qwen2.py", line 157, in forward attn_output = self.attn(q, k, v, forward_batch) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/data/rin_experiements/sglang/python/sglang/srt/layers/radix_attention.py", line 60, in forward return forward_batch.attn_backend.forward(q, k, v, self, forward_batch) File "/data/rin_experiements/sglang/python/sglang/srt/layers/attention/__init__.py", line 41, in forward return self.forward_extend(q, k, v, layer, forward_batch) File "/data/rin_experiements/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 222, in forward_extend forward_batch.token_to_kv_pool.set_kv_buffer( File "/data/rin_experiements/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 200, in set_kv_buffer self.k_buffer[layer_id][loc] = cache_k RuntimeError: shape mismatch: value tensor of shape [6, 4, 128] cannot be broadcast to indexing result of shape [1, 4, 128]

Do you have any recommendations on how to navigate the repository for fixes?

Update

turn out using --disable-radix solves my issue

majunze2001 · 2024-11-08T18:00:50Z

@RinRin-32 Do you have a commit/branch? I am interested to take a further look.

RinRin-32 · 2024-11-09T01:45:19Z

@majunze2001 Sure thing! My organization worked based on a fork of 0.3.2. I was discourage to do a pull request seeing that 0.3.3 structure changed drastically. Seeing the current main, my implementation would likely work here. I'll make the pull request in a week or two and link it here.

The main changes I worked on are in
python/sglang/srt/managers/
io_struct.py
schedule_batch.py
scheduler.py
tokenizer_manager.py

python/sglang/srt/model_executor/
forward_batch_info.py
model_runner.py

RinRin-32 · 2024-11-16T12:04:15Z

@majunze2001 I've just made my pull request
Please check it out at
#2052

There are still some flaws like the lack of args for serving using input_embeds, I've documented this in the pull request.
I tried to keep the if else conditions to the minimum, hoping other contributors can help optimize it!

zhyncs added the backlog label Jul 26, 2024

merrymercy added enhancement New feature or request high priority and removed backlog labels Jul 27, 2024

merrymercy mentioned this issue Jul 27, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

zTaoplus mentioned this issue Aug 20, 2024

WIP: support table multi data base on vllm-0.5.4 tablegpt/vllm#2

Closed

4 tasks

Ying1123 mentioned this issue Sep 21, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

github-actions bot closed this as completed Oct 4, 2024

github-actions bot added the inactive label Oct 4, 2024

merrymercy reopened this Oct 6, 2024

merrymercy removed the inactive label Nov 12, 2024

merrymercy mentioned this issue Nov 14, 2024

[Feature] Does sglang support only input embeds? #1992

Closed

merrymercy added the good first issue Good for newcomers label Nov 14, 2024

RinRin-32 mentioned this issue Nov 16, 2024

Input_embeds support #2052

Merged

3 tasks

XuehaiPan mentioned this issue Nov 18, 2024

feat(srt): support prefill and generate with input_embeds #2082

Closed

3 tasks

tp-nan mentioned this issue Dec 11, 2024

[Feature] Enhanced support/structure for Multi-modal models #2439

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Generation Inputs: input_embeds #745

[Feature] Generation Inputs: input_embeds #745

AlekseyKorshuk commented Jul 26, 2024 •

edited

Loading

joshpxyne commented Jul 26, 2024

jsdir commented Jul 26, 2024

tunahfishy commented Jul 26, 2024

ummagumm-a commented Jul 27, 2024

merrymercy commented Jul 27, 2024 •

edited

Loading

Ying1123 commented Aug 4, 2024

AlekseyKorshuk commented Aug 4, 2024

github-actions bot commented Oct 4, 2024

RinRin-32 commented Oct 15, 2024 •

edited

Loading

High-level Idea

Implementation

majunze2001 commented Nov 8, 2024

RinRin-32 commented Nov 9, 2024

RinRin-32 commented Nov 16, 2024 •

edited

Loading

[Feature] Generation Inputs: input_embeds #745

[Feature] Generation Inputs: input_embeds #745

Comments

AlekseyKorshuk commented Jul 26, 2024 • edited Loading

Motivation

Why is this important

Proposed usages

Related resources

joshpxyne commented Jul 26, 2024

jsdir commented Jul 26, 2024

tunahfishy commented Jul 26, 2024

ummagumm-a commented Jul 27, 2024

merrymercy commented Jul 27, 2024 • edited Loading

High-level Idea

Implementation

Ying1123 commented Aug 4, 2024

AlekseyKorshuk commented Aug 4, 2024

github-actions bot commented Oct 4, 2024

RinRin-32 commented Oct 15, 2024 • edited Loading

High-level Idea

Implementation

Update

majunze2001 commented Nov 8, 2024

RinRin-32 commented Nov 9, 2024

RinRin-32 commented Nov 16, 2024 • edited Loading

AlekseyKorshuk commented Jul 26, 2024 •

edited

Loading

merrymercy commented Jul 27, 2024 •

edited

Loading

RinRin-32 commented Oct 15, 2024 •

edited

Loading

RinRin-32 commented Nov 16, 2024 •

edited

Loading