[Core] generate from input embeds #6869

Nan2018 · 2024-07-27T22:48:56Z

adds support for passing prompt_embeds to LLM.generate as

llm.generate({"prompt_embeds": input_embeds}, sampling_params)

or

llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)

this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers

FIX #416
FIX #8323

inspired by #1265 which is very outdated

github-actions · 2024-07-27T22:49:07Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Nan2018 · 2024-08-08T21:05:08Z

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

ywang96 · 2024-08-08T21:23:24Z

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by #7238. Can you merge your branch with main again?

DarkLight1337 · 2024-11-06T03:46:36Z

if we want to support this, we might need to factor out something like self.process_input , just like we factor out self.sample() . the model runner needs to call the process_input directly, and the model's forward always see hidden_states as input.

Is the idea to call self.process_input outside of torch.compile so that the rest of the operations can still benefit from torch.compile?

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

youkaichao · 2024-11-06T07:29:02Z

Is the idea to call self.process_input outside of torch.compile so that the rest of the operations can still benefit from torch.compile?

yes, correct.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 · 2024-11-06T09:22:37Z

I have moved the .all().item() call outside of get_input_embeds, see if it can support torch.compile now.

youkaichao · 2024-11-06T17:25:05Z

vllm/model_executor/models/utils.py

+            assert input_ids is not None, msg
+
+            hidden_states = embeddings_module(input_ids)
+            hidden_states[inputs_embeds_masks] = inputs_embeds


this line is not compatible with compilation, the shape of inputs_embeds does not match hidden_states .

So basically, in-place operations aren't allowed?

in-place is fine, but we need all operations to have the same batchsize.

youkaichao · 2024-11-06T17:28:12Z

the requirement for compilation is:

during the lifespan of the model, the computation graph does not change. if hidden_states come from input_ids, it needs to always come from input_ids. it cannot occasionally come from input_ids for some requests, while using input_embeds for some other requests.

see #9946 for example of how to make vision language model compatible with compilation.

in vision-language model, this part of logic is not in compile-region. we only compile language tower.

lzl-mt · 2024-11-07T07:51:27Z

Excited to use this feature!

DarkLight1337 · 2024-11-07T14:33:06Z

the requirement for compilation is:

during the lifespan of the model, the computation graph does not change. if hidden_states come from input_ids, it needs to always come from input_ids. it cannot occasionally come from input_ids for some requests, while using input_embeds for some other requests.

see #9946 for example of how to make vision language model compatible with compilation.

in vision-language model, this part of logic is not in compile-region. we only compile language tower.

Hmm, I see. I'm quite against changing the semantics of forward to only work on embedded inputs though since that would cause some confusion for developers who are used to working with HF models. Could we instead selectively torch.compile individual methods? (Not sure why the current decorator is hardcoded to compile forward method)

OswaldoBornemann · 2024-11-11T04:43:47Z

@DarkLight1337 It seems that this feature is almost ready to be used?

DarkLight1337 · 2024-11-11T04:52:38Z

@DarkLight1337 It seems that this feature is almost ready to be used?

It is incompatible with the recent changes to vLLM internals. I'm waiting for the dust to settle for V2 engine before picking this back up.

lzl-mt · 2024-11-12T08:32:17Z

@DarkLight1337 It seems that this feature is almost ready to be used?

It is incompatible with the recent changes to vLLM internals. I'm waiting for the dust to settle for V2 engine before picking this back up.

Can I run it directly using your branch?

DarkLight1337 · 2024-11-12T08:33:34Z

Can I run it directly using your branch?

No, the current branch is broken because I have already merged in those changes from main.

DarkLight1337 · 2024-11-12T08:34:22Z

I think the latest working commit is 49fe3f7

lzl-mt · 2024-11-12T08:34:54Z

Can I run it directly using your branch?

No, the current branch is broken because I have already merged in those changes from main.

Umm.. Is there a working commit I can reset back to?

Nan2018 · 2024-11-14T04:30:02Z

Hmm, I see. I'm quite against changing the semantics of forward to only work on embedded inputs though since that would cause some confusion for developers who are used to working with HF models. Could we instead selectively torch.compile individual methods? (Not sure why the current decorator is hardcoded to compile forward method)

Perhaps an option is to move the embedding logic to ModelForCausalLM.forward and compile the Model classes instead of the ModelForCausalLM classes?

toilaluan · 2024-11-18T08:17:15Z

@DarkLight1337 Do you guys have ETA of this feature?

DarkLight1337 · 2024-11-18T08:19:28Z

@DarkLight1337 Do you guys have ETA of this feature?

No ETA yet, see when V2 re-arch is done.

DarkLight1337 · 2024-11-18T08:23:41Z

I think that this PR will eventually be superseded by the follow-up work to #10374 which should make it easy to support embedding inputs.

DaoD · 2024-11-25T11:47:23Z

Hi there, if there is any update for this PR?

DarkLight1337 · 2024-11-25T11:51:41Z

Hi there, if there is any update for this PR?

Please read the above comment.

mergify · 2024-12-11T09:29:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Nan2018.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

CandiedCode · 2024-12-19T07:24:42Z

@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking?

DarkLight1337 · 2024-12-19T07:47:44Z

@DarkLight1337 since #10374 has landed, is this pr going to be updated so that this work can be finished, or is there a new PR you can reference for tracking?

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

CandiedCode · 2024-12-19T08:39:24Z

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well?

ywang96 · 2024-12-19T09:07:20Z

Thanks for your interest. At the moment, V1 isn't stable enough to implement embedding inputs yet. I would point you to #8779 to check the progress, but that RFC is somewhat outdated. Instead, you can search for recent PRs with the [V1] tag.

Thanks for the reference, @DarkLight1337 . Is it possible also to add this to the roadmap for additional visibility as well?

Hey @CandiedCode! Thanks for following up on this.

IMO supporting embeddings as input as is does not have technical difficulty but we do want to be careful with the design to make it work with all other features we want to natively support on vLLM, especially now that we're going through re-architecture. I have discussed briefly with @WoosukKwon in #11032 (comment) about it.

In particular, some issues we still need to design and address:

What happens if a batch has both token ids as input and embeddings as input?
Prefix caching (Currently we use token ids as hash key)
Spec decode (we assume draft models to output token id to be accepted by main model)

Nan2018 and others added 3 commits July 27, 2024 16:23

feat: add support for generate from prompt embeddings

ed8eb22

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

7ccbf01

Merge branch 'vllm-project:main' into feature-input-embeds

7c663a1

fix ci errors

3bd6423

DarkLight1337 requested review from WoosukKwon, ywang96 and robertgshaw2-neuralmagic July 28, 2024 01:10

Nan2018 added 12 commits August 5, 2024 13:54

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

86777a7

fix: tensor parallel

9a9d406

style: yapf

48fc6a8

fix: model_runner in a WorkerWrapper

737d01b

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

4b99109

fix: spec decoding model

03344ab

fix: ruff

40038e0

fix: move param prompt_embeds_shape to the last of RequestOutput

b00bbc7

feat: all *ForCausalLM models support inputs_embeds

3c1a6fa

fix: format

c454647

fix: format

535ad97

fix: format

c05a8ff

This was referenced Aug 22, 2024

[Core][VLM] Support image embeddings as input #6613

Merged

[RFC]: Multi-modality Support Refactoring #4194

Open

Nan2018 added 4 commits September 4, 2024 12:38

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

2b50573

Merge remote-tracking branch 'vllm/main' into feature-input-embeds

7ddc863

fix: engines

d83915b

fix: format

dfd9301

Nan2018 force-pushed the feature-input-embeds branch from 3f79009 to dfd9301 Compare September 4, 2024 22:27

fix: format

fd455eb

Merge branch 'main' into feature-input-embeds

b8aaa8e

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 added 2 commits November 6, 2024 09:21

Update get_inputs_embeds to be compatible with torch.compile

c8fc1fe

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Merge branch 'main' into feature-input-embeds

85124f7

Merge branch 'main' into feature-input-embeds

a7429ad

youkaichao reviewed Nov 6, 2024

View reviewed changes

DarkLight1337 mentioned this pull request Nov 15, 2024

[Feature]: To adapt to the TTS task, I need to directly pass in the embedding. How should I modify it? #10323

Open

1 task

mergify bot added the needs-rebase label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] generate from input embeds #6869

[Core] generate from input embeds #6869

Nan2018 commented Jul 27, 2024 •

edited by DarkLight1337

Loading

github-actions bot commented Jul 27, 2024

Nan2018 commented Aug 8, 2024 •

edited

Loading

ywang96 commented Aug 8, 2024

DarkLight1337 commented Nov 6, 2024

youkaichao commented Nov 6, 2024

DarkLight1337 commented Nov 6, 2024

youkaichao Nov 6, 2024

DarkLight1337 Nov 7, 2024

youkaichao Nov 7, 2024

youkaichao commented Nov 6, 2024

lzl-mt commented Nov 7, 2024

DarkLight1337 commented Nov 7, 2024 •

edited

Loading

OswaldoBornemann commented Nov 11, 2024

DarkLight1337 commented Nov 11, 2024

lzl-mt commented Nov 12, 2024

DarkLight1337 commented Nov 12, 2024

DarkLight1337 commented Nov 12, 2024

lzl-mt commented Nov 12, 2024

Nan2018 commented Nov 14, 2024

toilaluan commented Nov 18, 2024

DarkLight1337 commented Nov 18, 2024

DarkLight1337 commented Nov 18, 2024

DaoD commented Nov 25, 2024

DarkLight1337 commented Nov 25, 2024

mergify bot commented Dec 11, 2024

CandiedCode commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024 •

edited

Loading

CandiedCode commented Dec 19, 2024 •

edited

Loading

ywang96 commented Dec 19, 2024 •

edited

Loading

[Core] generate from input embeds #6869

Are you sure you want to change the base?

[Core] generate from input embeds #6869

Conversation

Nan2018 commented Jul 27, 2024 • edited by DarkLight1337 Loading

github-actions bot commented Jul 27, 2024

Nan2018 commented Aug 8, 2024 • edited Loading

ywang96 commented Aug 8, 2024

DarkLight1337 commented Nov 6, 2024

youkaichao commented Nov 6, 2024

DarkLight1337 commented Nov 6, 2024

youkaichao Nov 6, 2024

Choose a reason for hiding this comment

DarkLight1337 Nov 7, 2024

Choose a reason for hiding this comment

youkaichao Nov 7, 2024

Choose a reason for hiding this comment

youkaichao commented Nov 6, 2024

lzl-mt commented Nov 7, 2024

DarkLight1337 commented Nov 7, 2024 • edited Loading

OswaldoBornemann commented Nov 11, 2024

DarkLight1337 commented Nov 11, 2024

lzl-mt commented Nov 12, 2024

DarkLight1337 commented Nov 12, 2024

DarkLight1337 commented Nov 12, 2024

lzl-mt commented Nov 12, 2024

Nan2018 commented Nov 14, 2024

toilaluan commented Nov 18, 2024

DarkLight1337 commented Nov 18, 2024

DarkLight1337 commented Nov 18, 2024

DaoD commented Nov 25, 2024

DarkLight1337 commented Nov 25, 2024

mergify bot commented Dec 11, 2024

CandiedCode commented Dec 19, 2024

DarkLight1337 commented Dec 19, 2024 • edited Loading

CandiedCode commented Dec 19, 2024 • edited Loading

ywang96 commented Dec 19, 2024 • edited Loading

Nan2018 commented Jul 27, 2024 •

edited by DarkLight1337

Loading

Nan2018 commented Aug 8, 2024 •

edited

Loading

DarkLight1337 commented Nov 7, 2024 •

edited

Loading

DarkLight1337 commented Dec 19, 2024 •

edited

Loading

CandiedCode commented Dec 19, 2024 •

edited

Loading

ywang96 commented Dec 19, 2024 •

edited

Loading