[Feature]: Support Phi4Flash model in V1 #23996

aditchawdhary · 2025-08-31T00:53:58Z

Purpose

Adds support for Microsoft Phi4Flash model in vLLM's V1 engine architecture.
Addressing #23957

Test Plan

export VLLM_USE_V1=1

# Test the actual Phi4Flash model that the PR targets
python3 << 'EOF'
from vllm import LLM, SamplingParams

print("Testing microsoft/Phi-4-mini-flash-reasoning with V1 engine...")

llm = LLM(
    model="microsoft/Phi-4-mini-flash-reasoning",
    trust_remote_code=True,
    max_model_len=1024,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.25
)

outputs = llm.generate(["What is AI?"], SamplingParams(max_tokens=50))
print("Result:", outputs[0].outputs[0].text)
print("SUCCESS: Phi-4-mini-flash-reasoning works with V1 engine!")
EOF

Test Result

V1-Specific Features

Chunked Prefill: Working (232-405 tok/s with tensor parallelism)
Prefix Caching: Working (with memory management)
Combined Features: Working (chunked prefill + prefix caching)
Tensor Parallelism: Working across 4 GPUs

Performance Results
Configuration: A40 GPU, tensor_parallel_size=4, V1 engine enabled

Chunked Prefill: 405.5 tok/s
Standard Config: 118-230 tok/s
Load Time: ~45s (includes torch.compile optimization)
Memory Usage: ~1.82 GiB per GPU (7.3 GiB total distributed)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

tdoublep

Thanks for picking this up!

Can we try to minimize the amount of whitespace + unnecessary diff to facilitate review process?

There seems to be a fundamental issue iwith how the conv state and ssm state are managed. Please take a look at how it works for mamba1 or mamba2.

I was also expecting that we would need to port the differential attention backend to V1. Currently it is only implemented for V0.

tests/models/language/generation/test_hybrid.py

vllm/model_executor/models/phi4flash.py

vllm/v1/attention/backends/phi4mamba_attn.py

tdoublep · 2025-09-02T13:00:20Z

vllm/model_executor/models/phi4flash.py

Same here - the attention metadata shoud not contain the ssm state (in fact, looking at the definition of the class below, it does not). I don't understand how this code works at all?!

I have tried to use mamba_mixer now

mergify · 2025-09-03T09:45:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aditchawdhary.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: aditchawdhary <aditxy@hotmail.com>

heheda12345 · 2025-09-03T18:47:13Z

vllm/model_executor/models/phi4flash.py

+            attn_output = self.attn(q,
+                                    k,
+                                    v,
+                                    kv_cache=kv_cache,


don't need to pass kv_cache and attn_metadata to attention kernel now.

mergify · 2025-09-03T18:47:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aditchawdhary.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 · 2025-09-03T18:47:30Z

vllm/model_executor/models/phi4flash.py

+        attn_metadata: AttentionMetadata,
    ):
-
        if not self.yoco_cross:  # need to generate kv-cache


Can you remove the kv sharing related if-else like this? v1 engine can handle kv sharing automatically

congcongchen123 · 2025-09-25T20:27:27Z

Hi @aditchawdhary , are you still actively on this? With this PR #25400, now there is no way to run phi4flash in vLLM any more.

aditchawdhary · 2025-09-26T02:44:01Z

Hi @aditchawdhary , are you still actively on this? With this PR #25400, now there is no way to run phi4flash in vLLM any more.

I tried to run msft/phi4 mini flash reasoning with v0 from the branch it was originally merged on and I could not run it. Do you have some insight on how to get this to work?

renll · 2025-09-28T01:19:44Z

Hi @aditchawdhary, thanks a lot for your contribution on V1 integration! We have reverted the huggingface repo here and can now successfully serve with the following command: VLLM_ATTENTION_BACKEND="DIFFERENTIAL_FLASH_ATTN" vllm serve microsoft/Phi-4-mini-flash-reasoning --host 127.0.0.1 --port 26500 --trust-remote-code --served-model-name BENCHMARK_MODEL_NAME --tensor-parallel-size 1 --max-model-len 104208 --max-seq-len-to-capture 104208 --no-enable-chunked-prefill --no-enable-prefix-caching

Could you please try and see if this works out for you?

mergify bot added the v1 label Aug 31, 2025

aditchawdhary force-pushed the phi4miniflashreasoning branch from 6dc9863 to e1b5279 Compare September 2, 2025 00:39

aditchawdhary changed the title ~~[Feature]: Support Phi4Flash model in V1 #23957~~ [Feature]: Support Phi4Flash model in V1 Sep 2, 2025

aditchawdhary changed the title ~~[Feature]: Support Phi4Flash model in V1~~ [Feature]: Support Phi4FlashReasoning model in V1 Sep 2, 2025

aditchawdhary changed the title ~~[Feature]: Support Phi4FlashReasoning model in V1~~ [Feature]: Support Phi4Flash model in V1 Sep 2, 2025

aditchawdhary marked this pull request as ready for review September 2, 2025 01:01

aditchawdhary requested review from DarkLight1337, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tdoublep and ywang96 as code owners September 2, 2025 01:01

tdoublep reviewed Sep 2, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 3, 2025

aditchawdhary force-pushed the phi4miniflashreasoning branch from e1b5279 to f5acf2c Compare September 3, 2025 10:43

mergify bot removed the needs-rebase label Sep 3, 2025

aditchawdhary and others added 3 commits September 3, 2025 03:44

Fix phi4flash V1 compatibility

f5acf2c

lint fix

5d2d1c7

Signed-off-by: aditchawdhary <aditxy@hotmail.com>

Merge branch 'main' into phi4miniflashreasoning

10bb6b0

heheda12345 reviewed Sep 3, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 3, 2025

heheda12345 reviewed Sep 3, 2025

View reviewed changes

tdoublep mentioned this pull request Sep 22, 2025

[V1] Remove V0 code paths for Hybrid models #25400

Merged

17 tasks

Uh oh!

[Feature]: Support Phi4Flash model in V1 #23996

Are you sure you want to change the base?

[Feature]: Support Phi4Flash model in V1 #23996

Conversation

aditchawdhary commented Aug 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

tdoublep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdoublep Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

aditchawdhary Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 3, 2025

Uh oh!

heheda12345 Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 3, 2025

Uh oh!

heheda12345 Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

congcongchen123 commented Sep 25, 2025

Uh oh!

aditchawdhary commented Sep 26, 2025

Uh oh!

renll commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aditchawdhary commented Aug 31, 2025 •

edited by github-actions bot

Loading

tdoublep left a comment •

edited

Loading