-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support serving encoder/decoder models #7258
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some explanations
logit_bias: Dict[int, float], | ||
token_ids: List[int], | ||
logits: torch.Tensor, | ||
) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error appeared after updating mypy
so I had to fix it in this PR.
) -> List[Tuple[PromptInputs, PromptInputs]]: | ||
return [(enc_dec_prompt['encoder_prompt'], | ||
enc_dec_prompt['decoder_prompt']) | ||
for enc_dec_prompt in enc_dec_prompts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since vllm.inputs
could import utils
but these functions require vllm.inputs
, I have moved them to vllm.inputs.parse
to avoid circular imports.
raise ValueError("prompt must be a string, array of strings, " | ||
"array of tokens, or array of token arrays") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have been moved to vllm.inputs.parse
.
|
||
raise ValueError(f"Invalid prompt {prompt}") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer required as I have moved the parsing logic inside the engine class itself.
AsyncLLMEngine
AsyncLLMEngine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @DarkLight1337 these improvements are great. I just had a few questions/suggestions. Overall LGTM.
vllm/engine/llm_engine.py
Outdated
decoder_prompt, decoder_prompt_ids, decoder_mm_data = decoder_comps | ||
|
||
if encoder_mm_data is not None or decoder_mm_data is not None: | ||
raise ValueError("Multi-modal data is not supported for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: currently the encoder/decoder infrastructure doesn't support multi-modal models (fixing this is a near-term goal.)
Perhaps "Multi-modal inputs are not supported for "
"encoder-decoder models"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically the existing multi-modal models use a vision encoder, so I wanted to make it clear that "encoder" here refers to language encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I am about to make it not critical, don't block merging on addressing this - but just for context, the limitation is really that encoder/decoder models with cross-attention do not currently support multi-modal.
So for example, a model like Llava has an encoder but not cross-attention, and obviously vLLM supports Llava (and did even before encoder/decoder model support was introduced.)
However, a vision model setup like the one shown in the diagram (taken from a blog post) requires cross attention between ViT and RoBERTa in order to decode a description of the image; even though this is a vision model, it would not currently be supported by vLLM because it requires both multi-modal support and encoder/decoder cross-attention, which is a currently-unsupported combination (for example, EncoderDecoderModelRunner has an assert which checks that multimodality is not enabled.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course this inability to have multimodal+encoder/decoder will need to be addressed by the time we add the Whisper model to vLLM, which is a near-term goal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing more context, I see what you mean by encoder-decoder in multi-modal context now. Yeah, we will need to address this soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the error to say "multi-modal" encoder-decoder models are not supported yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw this comment - yes like @afeldman-nm said, this is something we should start thinking about since Llama 3.1 multi-modal has exactly the same inference pattern like what Andrew described here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywang96 FYI I just released an RFC giving an overview of next-steps for vLLM's encoder/decoder support, and referenced this discussion in the section on multimodality support: #7366 (comment)
Feel free to leave any feedback that you have
completion = await client.completions.create( | ||
model=model_name, | ||
prompt=[0, 0, 0, 0, 0], | ||
max_tokens=5, | ||
temperature=0.0, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great that we support this now. Does it make sense maybe to have two different test prompts, one being a singleton encoder prompt, the other being an explicit encoder/decoder prompt? This would help confirm that the prompt processing pipeline works as it should for the async scenario (not that I can think of a reason it wouldn't.)
An example prompt could be
{
'encoder_prompt': 'The rain in spain',
'decoder_prompt': [0,0,0,0,0]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this input format is specified in the OpenAI API, so users shouldn't run into problems unless they use the async engine directly. Since this PR is pretty big already, let's leave this for future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay then LGTM.
@@ -334,17 +438,19 @@ async def add_request_async( | |||
trace_headers: Optional[Mapping[str, str]] = None, | |||
prompt_adapter_request: Optional[PromptAdapterRequest] = None, | |||
) -> None: | |||
"""Async version of :meth:`add_request`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not change in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite get what you mean by this. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this function needs changes in this PR - just a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - this is just to keep the order of arguments consistent with the new ordering of parameters in process_model_inputs_async
(which has been updated alongside process_model_inputs
).
I think this looks good. Separately @DarkLight1337 (for follow up) perhaps we should create a new class called |
I think that it would be best to keep the same class structure as the base |
request_id=request_id, | ||
) | ||
|
||
encoder_comps, decoder_comps = await asyncio.gather( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the order matter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's just how asyncio.gather
works.
This PR cleans up the parsing logic introduced in #4942 to pass
mypy
, and updates the async engine to follow the same code structure, enabling encoder/decoder models to be served using the OpenAI-compatible server. I have added a basic test accordingly.Other related changes:
vllm.inputs.data
to a new modulevllm.inputs.parse
. To declutter import statements, these functions now have to be imported fromvllm.inputs.parse
explicitly rather than fromvllm.inputs
.test_bart.py
due to mismatched text have been fixed by padding the vLLM output string with BOS/EOS tokens.is_list_of
helper tovllm.utils
(ported from #7126). This avoids the need for type casts inparse_and_batch_prompt
.TypeGuard
with the newly-introducedTypeIs
construct. The minimum versions ofmypy
andtyping_extensions
have been bumped accordingly.cc @afeldman-nm