-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Feature]: Support Phi4Flash model in V1 #23996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feature]: Support Phi4Flash model in V1 #23996
Conversation
6dc9863 to
e1b5279
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up!
Can we try to minimize the amount of whitespace + unnecessary diff to facilitate review process?
There seems to be a fundamental issue iwith how the conv state and ssm state are managed. Please take a look at how it works for mamba1 or mamba2.
I was also expecting that we would need to port the differential attention backend to V1. Currently it is only implemented for V0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - the attention metadata shoud not contain the ssm state (in fact, looking at the definition of the class below, it does not). I don't understand how this code works at all?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to use mamba_mixer now
|
This pull request has merge conflicts that must be resolved before it can be |
e1b5279 to
f5acf2c
Compare
| attn_output = self.attn(q, | ||
| k, | ||
| v, | ||
| kv_cache=kv_cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to pass kv_cache and attn_metadata to attention kernel now.
|
This pull request has merge conflicts that must be resolved before it can be |
| attn_metadata: AttentionMetadata, | ||
| ): | ||
|
|
||
| if not self.yoco_cross: # need to generate kv-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the kv sharing related if-else like this? v1 engine can handle kv sharing automatically
|
Hi @aditchawdhary , are you still actively on this? With this PR #25400, now there is no way to run phi4flash in vLLM any more. |
I tried to run msft/phi4 mini flash reasoning with v0 from the branch it was originally merged on and I could not run it. Do you have some insight on how to get this to work? |
|
Hi @aditchawdhary, thanks a lot for your contribution on V1 integration! We have reverted the huggingface repo here and can now successfully serve with the following command: Could you please try and see if this works out for you? |
Purpose
Adds support for Microsoft Phi4Flash model in vLLM's V1 engine architecture.
Addressing #23957
Test Plan
Test Result
V1-Specific Features
Chunked Prefill: Working (232-405 tok/s with tensor parallelism)
Prefix Caching: Working (with memory management)
Combined Features: Working (chunked prefill + prefix caching)
Tensor Parallelism: Working across 4 GPUs
Performance Results
Configuration: A40 GPU, tensor_parallel_size=4, V1 engine enabled
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.