[Model] New model support for microsoft/Phi-4-mini-flash-reasoning #20702

congcongchen123 · 2025-07-09T19:06:23Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

New Model for https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning

co-author: @aatkinson and @renll

Test Plan

Test Result

(Optional) Documentation Update

github-actions · 2025-07-09T19:06:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @congcongchen123, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the microsoft/Phi-4-mini-flash-reasoning model, which utilizes a novel hybrid architecture integrating both FlashAttention and Mamba (State Space Model) layers. The changes involve implementing a new 'Differential FlashAttention' backend, integrating Mamba's selective scan and causal convolution operations, and enhancing the KV cache management system to support cross-layer sharing, optimizing memory usage for this complex model.

Highlights

New Model Support: Adds comprehensive support for the microsoft/Phi-4-mini-flash-reasoning model, which features a hybrid architecture combining FlashAttention and Mamba (SSM) layers.
Differential FlashAttention Backend: Introduces a specialized attention backend (DifferentialFlashAttentionBackend) designed for the Phi-4 model, implementing a unique 'differential' attention mechanism with split QKV and KV cache handling.
Mamba (SSM) Integration: Implements the Phi4Mamba module to support the State Space Model components of the Phi-4 architecture, including causal convolution and selective scan operations, with specific handling for KV cache sharing (yoco_kv, yoco_cross).
Cross-Layer KV Cache Sharing: Enhances the KV cache management system to support cross-layer KV cache sharing, allowing certain attention layers to reuse the KV cache from other layers, which can lead to significant memory savings for the Phi-4 model's hybrid architecture.
Dynamic SSM z Parameter: Modifies the Mamba selective scan CUDA kernel and its Python wrapper to dynamically handle the z parameter, making it optional rather than a mandatory input, which improves flexibility.
Conditional Hidden State Pruning: Adds a new parameter to the LogitsProcessor to conditionally prune hidden states, enabling optimization for specific model architectures like Phi-4 where intermediate states might be truncated.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a new model architecture, Phi4Flash, with a hybrid Mamba/Attention mechanism and cross-layer KV cache sharing. The changes involve modifications to CUDA kernels, attention backends, model layers, and configuration files. The review focuses on ensuring the correctness, maintainability, and performance of the new implementation, particularly concerning the handling of conditional logic, error conditions, and integration with existing vLLM features.

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

vllm/model_executor/layers/logits_processor.py

gemini-code-assist · 2025-07-09T19:09:14Z

vllm/model_executor/models/phi4flash.py

The condition if i == self.config.num_hidden_layers // 2 + 2: is used to conditionally execute code during a profile run. Ensure that this condition accurately identifies the intended layer for profiling and that the profiling logic does not inadvertently affect the behavior of other layers or the overall model execution.

gemini-code-assist · 2025-07-09T19:09:14Z

vllm/attention/layer.py

The NotImplementedError for cross-layer KV sharing in V0 has been commented out. Confirm that cross-layer KV sharing is indeed correctly handled or disabled elsewhere, and that this change does not inadvertently enable unsupported functionality.

vllm/attention/backends/differential_flash_attn.py

vllm/model_executor/models/registry.py

heheda12345

I like the current kv sharing implementation. It is quite clean. But not sure whether cross_layer_shared_block_tables is necessary.

Please add the required test as mentioned here https://docs.vllm.ai/en/latest/contributing/model/tests.html#required-tests

vllm/attention/layer.py

vllm/attention/backends/differential_flash_attn.py

vllm/model_executor/models/phi4flash.py

vllm/worker/model_runner.py

vllm/utils/__init__.py

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

luccafong · 2025-07-11T00:24:21Z

Is this runnable in V1?
@sarckk can we try with V1 on this model with your PR #19719 and see if any gains?

heheda12345 · 2025-07-11T02:19:25Z

can we try with V1 on this model with your PR #19719 and see if any gains?

You can use gemma-3n to test #19719 . This PR is not runable with v1 now and I think it needs some effort to enable v1.

tests/models/registry.py

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

ZV-Liu · 2025-07-23T08:25:31Z

After installing vllm through source code and starting it with the following command, the following error still occurs.
Thanks.
vllm-0.10.0rc2.dev35+g32142b3c6.cu124.dist-info

vllm serve /openbayes/input/input0/Phi-4-mini-flash-reasoning   --host 0.0.0.0   --port 8000   --gpu-memory-utilization 0.9   --max-num-seqs 8   --max-model-len 512   --tensor-parallel-size 1   --served-model-name Phi-4-mini-flash-reasoning   --trust-remote-code

  File "/output/vllm/vllm/model_executor/models/phi4flash.py", line 519, in <lambda>
    lambda prefix: SambaYDecoderLayer(config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/output/vllm/vllm/model_executor/models/phi4flash.py", line 452, in __init__
    self.attn = SambaYAttention(config,
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/output/vllm/vllm/model_executor/models/phi4flash.py", line 165, in __init__
    self.attn = Attention(
                ^^^^^^^^^^
  File "/output/vllm/vllm/attention/layer.py", line 167, in __init__
    self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'differential_flash_attention_config'

ccongcong321 · 2025-07-23T17:43:15Z

Hi @ZV-Liu , you need to set VLLM_ATTENTION_BACKEND to be DIFFERENTIAL_FLASH_ATTN.

hmellor · 2025-07-23T18:53:16Z

@ccongcong321 should that not have happened automatically? I see it being set in this PR

ccongcong321 · 2025-07-23T20:07:30Z

@hmellor , it is not selected automatically for microsoft/Phi-4-mini-flash-reasoning.
The current auto backend selection logic in vLLM is not designed to gracefully support selecting a specific backend for individual models. As a result, if we would like to support that for this model, we have to take an intrusive approach, modifying the code directly and adding a series of if-else conditions.

ZV-Liu · 2025-07-24T02:02:03Z

ccongcong321

When using vllm, is there a switch similar to the enable_thinking of the Qwen3 series models when calling the API?

hmellor · 2025-07-24T10:49:16Z

Ah my mistake, it's only auto-selected in the test in tests/models/test_initialization.py

ZV-Liu · 2025-07-25T06:38:41Z

AssertionError: Phi4Flash currently does not support prefix cachingI have set VLLM_ATTENTION_BACKEND to DIFFERENTIAL_FLASH_ATTN and successfully started the service. I remember using the V1 engine. However, when I tried to start it again, the following error occurred:

"Phi4Flash currently does not support prefix caching."

renll · 2025-07-26T23:09:53Z

AssertionError: Phi4Flash currently does not support prefix cachingI have set VLLM_ATTENTION_BACKEND to DIFFERENTIAL_FLASH_ATTN and successfully started the service. I remember using the V1 engine. However, when I tried to start it again, the following error occurred:

"Phi4Flash currently does not support prefix caching."

Unfortunately, we do not support prefix caching because of Mamba layers. And we currently only support V0 engine. cc. @congcongchen123

sarckk · 2025-07-28T02:28:06Z

@congcongchen123 @renll is V1 support planned?

ccongcong321 · 2025-07-28T17:45:31Z

@sarckk , yes, we are planning to support v1 once we get more bandwidth.

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com> Signed-off-by: x22x22 <wadeking@qq.com>

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

tdoublep · 2025-08-29T12:31:55Z

@ccongcong321 Was there a specific reason why this model didn't re-use MambaMixer2 layer? The reason I'm asking is it would be much easier to enable this model in V1 if it did not have a custom mamba layer. We are looking to strip out V0 code in the very near future.

aatkinson · 2025-08-29T15:24:12Z

Good point. Likely because development of this model pre-dated Bamba when MambaMixer2 was introduced. Rebasing was more approachable than reimplementing everything and ensured parity. Though there may be other reasons.

congcongchen123 requested review from alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners July 9, 2025 19:06

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

tlrmchlsmth reviewed Jul 9, 2025

View reviewed changes

vllm/attention/backends/differential_flash_attn.py Outdated Show resolved Hide resolved

congcongchen123 force-pushed the congcongchen/phi4-mini-shadow branch from 479eb8a to 0e20e17 Compare July 9, 2025 23:38

heheda12345 added the new-model Requests to new models label Jul 10, 2025

heheda12345 mentioned this pull request Jul 10, 2025

[Misc] Automatically tag PRs to add new models #20222

Merged

4 tasks

Isotr0py reviewed Jul 10, 2025

View reviewed changes

vllm/model_executor/models/registry.py Outdated Show resolved Hide resolved

congcongchen123 requested a review from hmellor as a code owner July 10, 2025 08:31

mergify bot added the documentation Improvements or additions to documentation label Jul 10, 2025

heheda12345 reviewed Jul 10, 2025

View reviewed changes

heheda12345 mentioned this pull request Jul 10, 2025

[V1] Partial prefill skip for layers reusing shared KV cache #19719

Open

tlrmchlsmth reviewed Jul 10, 2025

View reviewed changes

csrc/mamba/mamba_ssm/selective_scan_fwd.cu Outdated Show resolved Hide resolved

congcongchen123 requested review from DarkLight1337 and ywang96 as code owners July 11, 2025 09:08

mergify bot added the rocm Related to AMD ROCm label Jul 11, 2025

DarkLight1337 reviewed Jul 11, 2025

View reviewed changes

tests/models/registry.py Outdated Show resolved Hide resolved

congcongchen123 added 6 commits July 11, 2025 20:42

initial commit

b7f8e0c

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

Add a new backend

961e638

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

Use differential flash attn backend

c264085

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

clean up code

344190f

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

clean up code

3f89641

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

clean up

5a00414

Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

DarkLight1337 added this to the v0.10.0 milestone Jul 14, 2025

saattrupdan mentioned this pull request Jul 22, 2025

[MODEL EVALUATION REQUEST] microsoft/Phi-4-mini-flash-reasoning EuroEval/EuroEval#1097

Open

4 tasks

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (v…

6166a25

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (v…

2c3d9e2

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (v…

8dc3bba

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

ca1207 mentioned this pull request Aug 24, 2025

[Model] New model support for Motif-1-Tiny #23414

Merged

4 tasks

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning (v…

2598cd5

…llm-project#20702) Signed-off-by: Congcong Chen <congcongchen@microsoft.com>

ca1207 mentioned this pull request Oct 23, 2025

Support Motif MotifTechnologies/vllm#5

Open

WyldeCat mentioned this pull request Oct 23, 2025

[Model] Re-support MotifForCausalLM #27396

Open

5 tasks

Uh oh!

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning #20702

[Model] New model support for microsoft/Phi-4-mini-flash-reasoning #20702

Uh oh!

Conversation

congcongchen123 commented Jul 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luccafong commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented Jul 11, 2025

Uh oh!

Uh oh!

ZV-Liu commented Jul 23, 2025

Uh oh!

ccongcong321 commented Jul 23, 2025

Uh oh!

hmellor commented Jul 23, 2025

Uh oh!

ccongcong321 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZV-Liu commented Jul 24, 2025

Uh oh!

hmellor commented Jul 24, 2025

Uh oh!

ZV-Liu commented Jul 25, 2025

Uh oh!

renll commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarckk commented Jul 28, 2025

Uh oh!

ccongcong321 commented Jul 28, 2025

Uh oh!

tdoublep commented Aug 29, 2025

Uh oh!

aatkinson commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

congcongchen123 commented Jul 9, 2025 •

edited by github-actions bot

Loading

luccafong commented Jul 11, 2025 •

edited

Loading

ccongcong321 commented Jul 23, 2025 •

edited

Loading

renll commented Jul 26, 2025 •

edited

Loading

aatkinson commented Aug 29, 2025 •

edited

Loading