[Bugfix][Attention][DCP] Set reorder_batch_threshold back to 1 when using FlashMLA and enable DCP #27023

FENP · 2025-10-16T12:51:30Z

Purpose

For FlashMLA backend, #26541 set the default value of reorder_batch_threshold to 512.

vllm/vllm/v1/attention/backends/mla/flashmla.py

Lines 71 to 76 in 00417f4

    
           class FlashMLAMetadataBuilder(MLACommonMetadataBuilder[FlashMLAMetadata]): 
        
               cudagraph_support: ClassVar[AttentionCGSupport] = AttentionCGSupport.UNIFORM_BATCH 
        
               query_len_support: ClassVar[QueryLenSupport] = QueryLenSupport.UNIFORM 
        
               reorder_batch_threshold: int = 512  # process small prefills with decode pathway 
        
               # ^ TODO(matt): tune this

However, DCP support reorder_batch_threshold > 1 only when FlashAttnMLA backend is used (#25049). Therefore, the following assertion error occurs when using the FlashMLA backend.

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 586 to 596 in 00417f4

    
           if self.reorder_batch_threshold is not None: 
        
               # NOTE(lucas): currently no backend supports the custom masking 
        
               #  required for DCP with q_len > 1, so we assert here. Remove this 
        
               #  assert once the custom mask is support is added to FA3. 
        
               if ( 
        
                   self.dcp_world_size > 1 
        
                   and envs.VLLM_ATTENTION_BACKEND != "FLASH_ATTN_MLA" 
        
               ): 
        
                   assert self.reorder_batch_threshold == 1, ( 
        
                       "DCP not support reorder_batch_threshold > 1 now." 
        
                   )

This PR temporarily fixes the issue by setting reorder_batch_threshold back to 1.

Looking forward to DCP supporting reorder_batch_threshold > 1 with FlashMLA in the future :).

Test Plan

export VLLM_ATTENTION_BACKEND="FLASHMLA"
vllm serve /deepseek-ai/DeepSeek-R1/ --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --decode-context-parallel-size 8

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/ossfs/workspace/DeepSeek-R1/", "messages": [{"role": "user", "content": "who is there"}], "temperature": 0.0, "max_tokens": 100}'

Test Result

main

...
ERROR 10-16 20:49:34 [multiproc_executor.py:700]  assert self.reorder_batch_threshold == 1, (
ERROR 10-16 20:49:34 [multiproc_executor.py:700] AssertionError: DCP not support reorder_batch_threshold > 1 now.
...
INFO:     127.0.0.1:46314 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

this PR

INFO:     127.0.0.1:47140 - "POST /v1/chat/completions HTTP/1.1" 200 OK

cc @minosfuture @MatthewBonanni @youkaichao @LucasWilkinson

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request addresses a bug that causes an assertion error when using the FlashMLA backend with Decode Context Parallelism (DCP) and a reorder_batch_threshold greater than 1. The fix correctly identifies this unsupported configuration and resets reorder_batch_threshold to 1, along with query_len_support, to prevent the crash. My review includes a suggestion to improve the maintainability of the implementation by using a class attribute for feature detection instead of checking the class name as a string. This will make the code more robust against future refactoring.

gemini-code-assist · 2025-10-16T12:52:44Z

vllm/v1/attention/backends/mla/common.py

+        if (
+            self.dcp_world_size > 1 and self.reorder_batch_threshold > 1
+            and self.__class__.__name__ != "FlashAttnMLAMetadataBuilder"
+        ):


Checking the class name as a string (self.__class__.__name__) is fragile and can lead to silent bugs if the class is ever renamed. A more robust and maintainable approach is to use a class attribute to indicate feature support.

You can define a class attribute in MLACommonMetadataBuilder and override it in the specific subclass that supports this feature.

For example:

Add a new class attribute to MLACommonMetadataBuilder (e.g., right after reorder_batch_threshold):

class MLACommonMetadataBuilder(AttentionMetadataBuilder[M]): ... reorder_batch_threshold: int = 1 _supports_dcp_and_reorder: ClassVar[bool] = False ...

In the FlashAttnMLAMetadataBuilder class, override this attribute:

class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder): ... _supports_dcp_and_reorder: ClassVar[bool] = True ...

Then, update the condition here to use this new attribute, which is more idiomatic and safer.

Suggested change

if (

self.dcp_world_size > 1 and self.reorder_batch_threshold > 1

and self.__class__.__name__ != "FlashAttnMLAMetadataBuilder"

):

if (

self.dcp_world_size > 1 and self.reorder_batch_threshold > 1

and not self._supports_dcp_and_reorder

):

MatthewBonanni

In the following code from gpu_model_runner.py,

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 586 to 596 in 00417f4

    
           if self.reorder_batch_threshold is not None: 
        
               # NOTE(lucas): currently no backend supports the custom masking 
        
               #  required for DCP with q_len > 1, so we assert here. Remove this 
        
               #  assert once the custom mask is support is added to FA3. 
        
               if ( 
        
                   self.dcp_world_size > 1 
        
                   and envs.VLLM_ATTENTION_BACKEND != "FLASH_ATTN_MLA" 
        
               ): 
        
                   assert self.reorder_batch_threshold == 1, ( 
        
                       "DCP not support reorder_batch_threshold > 1 now." 
        
                   )

could we instead just change that assert to set the threshold? The metadata builder's threshold won't be updated, but what ultimately matters is the gpu model runner's threshold. i.e.

 if self.reorder_batch_threshold is not None: 
     # NOTE(lucas): currently no backend supports the custom masking 
     #  required for DCP with q_len > 1, so we assert here. Remove this 
     #  assert once the custom mask is support is added to FA3. 
     if ( 
         self.dcp_world_size > 1 
         and envs.VLLM_ATTENTION_BACKEND != "FLASH_ATTN_MLA" 
     ):
         logger.warning("This backend does not support DCP with q_len > 1. Setting reorder_batch_threshold to 1.")
         self.reorder_batch_threshold = 1

@LucasWilkinson

… enable DCP Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>

minosfuture

Thanks for the fix!

FENP requested a review from LucasWilkinson as a code owner October 16, 2025 12:51

mergify bot added the v1 label Oct 16, 2025

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

FENP force-pushed the bugfix/flashmla_reorder_batch_threshold branch from 8d76fee to 4af3874 Compare October 16, 2025 12:56

MatthewBonanni reviewed Oct 16, 2025

View reviewed changes

bugfix: set reorder_batch_threshold back to 1 when using FlashMLA and…

a2d5ef0

… enable DCP Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>

FENP force-pushed the bugfix/flashmla_reorder_batch_threshold branch from 4af3874 to a2d5ef0 Compare October 16, 2025 13:55

minosfuture approved these changes Oct 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Attention][DCP] Set reorder_batch_threshold back to 1 when using FlashMLA and enable DCP #27023

[Bugfix][Attention][DCP] Set reorder_batch_threshold back to 1 when using FlashMLA and enable DCP #27023

FENP commented Oct 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 16, 2025

Uh oh!

MatthewBonanni left a comment

Uh oh!

minosfuture left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	class FlashMLAMetadataBuilder(MLACommonMetadataBuilder[FlashMLAMetadata]):
	cudagraph_support: ClassVar[AttentionCGSupport] = AttentionCGSupport.UNIFORM_BATCH
	query_len_support: ClassVar[QueryLenSupport] = QueryLenSupport.UNIFORM
	reorder_batch_threshold: int = 512 # process small prefills with decode pathway
	# ^ TODO(matt): tune this

	if self.reorder_batch_threshold is not None:
	# NOTE(lucas): currently no backend supports the custom masking
	# required for DCP with q_len > 1, so we assert here. Remove this
	# assert once the custom mask is support is added to FA3.
	if (
	self.dcp_world_size > 1
	and envs.VLLM_ATTENTION_BACKEND != "FLASH_ATTN_MLA"
	):
	assert self.reorder_batch_threshold == 1, (
	"DCP not support reorder_batch_threshold > 1 now."
	)

Uh oh!

[Bugfix][Attention][DCP] Set reorder_batch_threshold back to 1 when using FlashMLA and enable DCP #27023

Are you sure you want to change the base?

[Bugfix][Attention][DCP] Set reorder_batch_threshold back to 1 when using FlashMLA and enable DCP #27023

Conversation

FENP commented Oct 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

main

this PR

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

minosfuture left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FENP commented Oct 16, 2025 •

edited by github-actions bot

Loading