Skip to content

Conversation

@aditchawdhary
Copy link
Contributor

@aditchawdhary aditchawdhary commented Aug 31, 2025

Purpose

Adds support for Microsoft Phi4Flash model in vLLM's V1 engine architecture.
Addressing #23957

Test Plan

export VLLM_USE_V1=1

# Test the actual Phi4Flash model that the PR targets
python3 << 'EOF'
from vllm import LLM, SamplingParams

print("Testing microsoft/Phi-4-mini-flash-reasoning with V1 engine...")

llm = LLM(
    model="microsoft/Phi-4-mini-flash-reasoning",
    trust_remote_code=True,
    max_model_len=1024,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.25
)

outputs = llm.generate(["What is AI?"], SamplingParams(max_tokens=50))
print("Result:", outputs[0].outputs[0].text)
print("SUCCESS: Phi-4-mini-flash-reasoning works with V1 engine!")
EOF

Test Result

V1-Specific Features

Chunked Prefill: Working (232-405 tok/s with tensor parallelism)
Prefix Caching: Working (with memory management)
Combined Features: Working (chunked prefill + prefix caching)
Tensor Parallelism: Working across 4 GPUs

Performance Results
Configuration: A40 GPU, tensor_parallel_size=4, V1 engine enabled

  • Chunked Prefill: 405.5 tok/s
  • Standard Config: 118-230 tok/s
  • Load Time: ~45s (includes torch.compile optimization)
  • Memory Usage: ~1.82 GiB per GPU (7.3 GiB total distributed)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the v1 label Aug 31, 2025
@aditchawdhary aditchawdhary changed the title [Feature]: Support Phi4Flash model in V1 #23957 [Feature]: Support Phi4Flash model in V1 Sep 2, 2025
@aditchawdhary aditchawdhary changed the title [Feature]: Support Phi4Flash model in V1 [Feature]: Support Phi4FlashReasoning model in V1 Sep 2, 2025
@aditchawdhary aditchawdhary changed the title [Feature]: Support Phi4FlashReasoning model in V1 [Feature]: Support Phi4Flash model in V1 Sep 2, 2025
@aditchawdhary aditchawdhary marked this pull request as ready for review September 2, 2025 01:01
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up!

Can we try to minimize the amount of whitespace + unnecessary diff to facilitate review process?

There seems to be a fundamental issue iwith how the conv state and ssm state are managed. Please take a look at how it works for mamba1 or mamba2.

I was also expecting that we would need to port the differential attention backend to V1. Currently it is only implemented for V0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - the attention metadata shoud not contain the ssm state (in fact, looking at the definition of the class below, it does not). I don't understand how this code works at all?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to use mamba_mixer now

@mergify
Copy link

mergify bot commented Sep 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aditchawdhary.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 3, 2025
@mergify mergify bot removed the needs-rebase label Sep 3, 2025
attn_output = self.attn(q,
k,
v,
kv_cache=kv_cache,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to pass kv_cache and attn_metadata to attention kernel now.

@mergify
Copy link

mergify bot commented Sep 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aditchawdhary.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 3, 2025
attn_metadata: AttentionMetadata,
):

if not self.yoco_cross: # need to generate kv-cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the kv sharing related if-else like this? v1 engine can handle kv sharing automatically

@congcongchen123
Copy link
Contributor

Hi @aditchawdhary , are you still actively on this? With this PR #25400, now there is no way to run phi4flash in vLLM any more.

@aditchawdhary
Copy link
Contributor Author

Hi @aditchawdhary , are you still actively on this? With this PR #25400, now there is no way to run phi4flash in vLLM any more.

I tried to run msft/phi4 mini flash reasoning with v0 from the branch it was originally merged on and I could not run it. Do you have some insight on how to get this to work?

@renll
Copy link

renll commented Sep 28, 2025

Hi @aditchawdhary, thanks a lot for your contribution on V1 integration! We have reverted the huggingface repo here and can now successfully serve with the following command: VLLM_ATTENTION_BACKEND="DIFFERENTIAL_FLASH_ATTN" vllm serve microsoft/Phi-4-mini-flash-reasoning --host 127.0.0.1 --port 26500 --trust-remote-code --served-model-name BENCHMARK_MODEL_NAME --tensor-parallel-size 1 --max-model-len 104208 --max-seq-len-to-capture 104208 --no-enable-chunked-prefill --no-enable-prefix-caching

Could you please try and see if this works out for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants