修复paddle.incubate.nn.functional.fused_rotary_position_embedding的非法地址访问问题 #74347

LCStayingdullCircuit · 2025-07-31T09:29:07Z

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

本次 PR 主要解决了 fused_rotary_position_embedding 函数中出现的 CUDA error 700 (illegal address) 问题。

问题根源：
该错误是由于在计算时，代码默认使用了 query (q) 的 batch_size 作为 key (k) 和 value (v) 张量的 batch_size_stride。当 q 的 batch_size 大于 k 或 v 的 batch_size 时，会导致显存的非法地址访问，从而引发 CUDA 错误。

此问题并非仅限于大Tensor的场景，以下测例同样可以复现该错误：

paddle.incubate.nn.functional.fused_rotary_position_embedding(Tensor([1682, 8, 2, 16],"float32"), Tensor([168, 8, 2, 16],"float32"), Tensor([168, 8, 2, 16],"float32"), Tensor([1, 8, 1, 16],"float32"), Tensor([1, 8, 1, 16],"float32"), position_ids=None, use_neox_rotary_style=True, time_major=False, )

解决方案：
限制q的batch_size不能超过k，v的batch_size大小。

paddle-bot · 2025-07-31T09:29:12Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

lshpku · 2025-07-31T09:37:25Z

paddle/phi/kernels/fusion/gpu/fused_rope_kernel.cu

+        common::errors::InvalidArgument("The batch_size of q (%d) must be less "
+                                        "than or equal to k's (%d) to "
+                                        "prevent out-of-bounds memory access.",
+                                        batch_size,
+                                        k_batch_size));


这样限制的目的，到底是因为我们算子实现不完备，所以不得不进行限制，还是从算法原理上就不允许超过？

我们的文档是没有详细描述的，我自己了解以后的判断是k和v的batch_size小于q的batch_size是不合理的，标准的应该是严格相等，这里我选择了保留大于的情况是因为q_batch_size < k/v_batch_size的测例可以pass，当前改法应该是改动最小的修改方法。
对于q_batch_size > k/v_batch_size这种情况，应该需要类似于广播机制这种额外的处理。

那这个报错信息就不够清晰，应该报错这是违背定义的情况，让用户知道自己写错了；而不是为了防止越界，这样变成是好像是我们错了

LCStayingdullCircuit · 2025-08-02T06:42:47Z

/re-run all-failed

LCStayingdullCircuit added 2 commits July 31, 2025 17:12

error 700:fused_rotary_position_embedding test=develop

b3d0010

error 700:fused_rotary_position_embedding test=develop

8e24b32

lshpku reviewed Jul 31, 2025

View reviewed changes

error 700:fused_rotary_position_embedding test=develop

8a824c5

lshpku previously approved these changes Aug 1, 2025

View reviewed changes

bugfix:fused_rotary_position_embedding test=develop

bdbf895

LCStayingdullCircuit dismissed lshpku’s stale review via bdbf895 August 1, 2025 07:51

lshpku approved these changes Aug 1, 2025

View reviewed changes

lshpku approved these changes Aug 2, 2025

View reviewed changes

lshpku merged commit 8e5cba3 into PaddlePaddle:develop Aug 2, 2025
70 of 71 checks passed

LCStayingdullCircuit mentioned this pull request Aug 4, 2025

删除了fused_rotary_position_embedding不合理的测例 PFCCLab/PaddleAPITest#496

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

修复paddle.incubate.nn.functional.fused_rotary_position_embedding的非法地址访问问题 #74347

修复paddle.incubate.nn.functional.fused_rotary_position_embedding的非法地址访问问题 #74347

Uh oh!

LCStayingdullCircuit commented Jul 31, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 31, 2025

Uh oh!

lshpku Jul 31, 2025

Uh oh!

LCStayingdullCircuit Jul 31, 2025 •

edited

Loading

Uh oh!

lshpku Jul 31, 2025

Uh oh!

LCStayingdullCircuit Jul 31, 2025

Uh oh!

LCStayingdullCircuit commented Aug 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

修复paddle.incubate.nn.functional.fused_rotary_position_embedding的非法地址访问问题 #74347

修复paddle.incubate.nn.functional.fused_rotary_position_embedding的非法地址访问问题 #74347

Uh oh!

Conversation

LCStayingdullCircuit commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jul 31, 2025

Uh oh!

lshpku Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

LCStayingdullCircuit Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lshpku Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

LCStayingdullCircuit Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

LCStayingdullCircuit commented Aug 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LCStayingdullCircuit commented Jul 31, 2025 •

edited

Loading

LCStayingdullCircuit Jul 31, 2025 •

edited

Loading