-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Generation Inputs: input_embeds #745
Comments
+1! |
+1 |
!!! |
having this feature would be nice, indeed |
Great suggestions. Let's prioritize this one. I can share some ideas and pointers. High-level IdeaSince many parts of the existing code rely on the concept of "input_ids: List[int]," it is not easy to fully change all of them, as this will create many problematic "if/else" conditions. I think one possible implementation idea is to create some random fake "input_ids" to make most of the existing code runnable. Then, during the actual forward pass, we can feed You can learn more about this idea by looking at how the existing Llava implementation directly feeds sglang/python/sglang/srt/models/llava.py Lines 241 to 243 in 0736b27
sglang/python/sglang/srt/models/llama2.py Lines 258 to 261 in 0736b27
ImplementationThe inference of a request starts with
This is my rough idea. I haven't implemented it yet, so there may be some mistakes. I hope it is helpful. |
@AlekseyKorshuk any updates? |
Last week was quite busy for me, so unfortunately have not started yet |
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed. |
Hello, I implemented accordingly to this high level overview and managed to get input_embeds working/generating response. My current issue is that I can only generate using input_embeds once, if I use input_embeds to generate again I get this error:
Do you have any recommendations on how to navigate the repository for fixes? Updateturn out using --disable-radix solves my issue |
@RinRin-32 Do you have a commit/branch? I am interested to take a further look. |
@majunze2001 Sure thing! My organization worked based on a fork of 0.3.2. I was discourage to do a pull request seeing that 0.3.3 structure changed drastically. Seeing the current main, my implementation would likely work here. I'll make the pull request in a week or two and link it here. The main changes I worked on are in python/sglang/srt/model_executor/ |
@majunze2001 I've just made my pull request There are still some flaws like the lack of args for serving using input_embeds, I've documented this in the pull request. |
Motivation
I propose to add
input_embeds
as an optional input to the generation params.Why is this important
Nowadays there are a lot of Vision Language Models (VLMs) and they all have similar architecture: vision tower, projector, LLM. This means vision_tower+projector just prepares embeddings for "image" tokens. So why not allow model developers to handle by themselves the preparation of
input_embeds
for the LLM?Lots of new models tend to allow the user to work with bounding boxes and segmentation masks like PaliGemma and Florence, making it quite complicated to add different processors and conversation templates to the codebase.
By allowing the user to provide
input_embeds
instead of list of messages or text prompts, you reduce your own headache in the future.Another point is that VLM developers can focus on caching image embeddings while building on top of the SGLang, allowing even higher throughput.
vLLM users required this feature long time ago and this topic gained a lot of positive attention from the community:
This unique feature will make the SGLang the main framework for all VLMs.
I am happy to help implement this if you direct me in the codebase and thank you for your time and consideration 🤗
Proposed usages
Related resources
LLM.generate()
vllm-project/vllm#416The text was updated successfully, but these errors were encountered: