-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support generation from input embedding #1265
base: main
Are you sure you want to change the base?
Conversation
We conducted several tests and confirmed that the performance degradation was not significant. In fact, we measured the benchmark 5 times for the main branch and feature branch using the command below.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello authors, I have tested this PR and completed the alignment with the latest prepare_inputs function. |
This comment was marked as resolved.
This comment was marked as resolved.
Hi @WoosukKwon ,
I have run the added test code and it passes all as shown below.
I know you are very busy with a lot of interest and requests for vLLM. |
@pfldy2850 Same call code:
The official output is significantly better than the embedding version of the code.The code has not changed, I am not clear whether the vllm installation is incorrect or there are other reasons, please help to look at it, thank you very much!if you need other information I can add. |
Hello @will-wiki , I have installed vLLM latest release v0.2.4, and run the script you provided.
I think that the changes on the internlm modeling file after version you tested make the differences. I hope this explanation help you to trouble shoot that problem. |
We've been using this branch in production and it works like a charm. Thanks so much for your contribution. Can't wait for it to be merged! |
thanks for this! Any plan to merge this into main anytime soon? |
Hello @zhuohan123 , I just saw that you created an issue for the vLLM Q1 2024 roadmap. If you have any plans to consider this feature or merge for this PR, |
This PR would be super valuable for us. @pfldy2850 Do you plan to adjust it to the current master branch ? Because I see it is a bit outdated. |
- stream: whether to stream the results or not. | ||
- other fields: the sampling parameters (See `SamplingParams` for details). | ||
""" | ||
request_dict = await request.json() | ||
prompt = request_dict.pop("prompt") | ||
prompt_embeds = request_dict.pop("prompt_embeds", None) | ||
if prompt_embeds is not None: | ||
prompt_embeds = torch.tensor(prompt_embeds).to("cuda") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loads stuff in float32. Eats all the GPU.
@@ -29,16 +30,27 @@ async def generate(request: Request) -> Response: | |||
|
|||
The request should be a JSON object with the following fields: | |||
- prompt: the prompt to use for the generation. | |||
- prompt_embeds: the prompt embedding to use for the generation | |||
instead of the prompt. | |||
- stream: whether to stream the results or not. | |||
- other fields: the sampling parameters (See `SamplingParams` for details). | |||
""" | |||
request_dict = await request.json() | |||
prompt = request_dict.pop("prompt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throws an error when only prompt_embeds are passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for doing this. I tested it and had some issues I ran into but fixed them locally.
Also, for some reason when serializing i got torch.cuda.is_available() as false. so had to set CUDA_VISBILE_DEVICES in ray. init.py
@zhuohan123 Do you have plans for this? It would be really helpful to me for this MR to get merged. I can help push it through if you need. |
We are doing this in this PR for llava support: #3042. Please take a look and let us know any suggestions! |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
This PR implements the feature of generating text from embedding input (popularly known as inputs_embeds).
This is related to #369 and #416.
More to do