[Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add #74637

Eddie-Wang1120 · 2025-08-15T07:32:57Z

PR Category

Execute Infrastructure

PR Types

Others

Description

pcard-67164

PR简介：基于DenseTensorIterator的算子计算机制优化

一、简介

本PR以Paddle DenseTensor数据结构为基础，设计并实现了基于DenseTensorIterator的算子计算机制，用于在处理非连续输入时获得性能提升/显存占用降低以及正确性生态对齐。

二、整体设计

三、实现方法

本PR主要实现了设计中的Paddle TensorIterator以及其相关的通用CUDA Stride Kernel部分，通过以上实现确保了elementwise_add算子性能及正确性提升，下面由上到下讲解具体实现。

1. Strided Kernel

该部分需要注册并实现一个GPU上的stride kernel。

实现位置: paddle/phi/kernels/stride/elementwise_kernel.cu
实现方法: 使用PD_REGISTER_KERNEL进行定义，并在elementwise_kernel.cu中实现
函数命名规则: XXStrideKernel
扩展机制: 为方便后续扩量生成StrideKernel，已使用宏语言编程方式做了处理，后续只需要通过例如DEFINE_CUDA_BINARY_ELEMENTWISE_STRIDE_OP(Add)的方式生成函数即可

2. Risk Control (Flag)

该部分需要实现一个用于风险管理的Flag，预防新增功能不稳定带来的正确性以及性能影响。

实现位置: paddle/common/flags.cc
Flag名称: FLAGS_use_stride_compute_kernel
使用方式:
当使用FLAGS_use_stride_compute_kernel=1运行Paddle程序时，Kernel会分派到新的实现方法，直接处理非连续输入运算
反之则会回退到之前版本，先将输入变连续再进行运算

3. Paddle TensorIterator (DenseTensorIterator)

该部分需要实现一个对任意张量进行运算前预处理的类，目前命名为DenseTensorIterator。

实现位置:
paddle/phi/kernels/funcs/dense_tensor_iterator.cc
paddle/phi/kernels/funcs/dense_tensor_iterator.h
主要功能: DenseTensorIterator保证了张量进入通用CUDA kernel前拥有最佳的shape/stride以供后续运算，同时保证了输出张量的连续性与生态对齐

4. CUDA strided_kernel_impl

该部分需要实现一个通用的CUDA kernel，用于处理DenseTensorIterator给出的shape和stride，并按照Functor的定义完成运算。

实现位置: paddle/phi/kernels/stride/elementwise_kernel.cu
实现方法: 使用Slice优化时引入的OffsetCalculator进行index计算，之后复用KP中的Functor相关实现进行运算
编译优化: 为了保证不带来过多的编译体积增加，本PR去除了KP相关的模版，只保留type以及Functor相关模版，保证了编译体积可控

四、相关影响

1. 正确性影响

当FLAGS_use_stride_compute_kernel=1时，elementwise_add将按照如下规则进行输出处理：

对于两个参与二元Elementwise操作的操作数，一个必要的前提是需要将两个操作数能广播到一个共同的尺寸上，我们定义这个尺寸为广播尺寸。

如果两个操作数的尺寸都为广播尺寸，那么输出的stride由第一个操作数决定（按顺序决定）
如果其中一个操作数需要广播到广播尺寸，那么输出的stride由与广播尺寸相同的操作数决定（与顺序无关）

当FLAGS_use_stride_compute_kernel=0时，elementwise_add将全输出连续张量。

2. 性能影响

Transpose + Add 性能测试

Shape	Before (μs)	After (μs)	Improvement
64×108×12288	2262.79	1243.12	81.9% faster

As_strided + Add 性能测试

Shape	Before (μs)	After (μs)	Improvement
54×32×12288	651.61	451.13	44.3% faster

五、风险管理

为保证FLAGS_use_stride_compute_kernel=1后正确性可以保证，对elementwise_add新增了单测。

单测位置: test/legacy_test/test_elementwise_add_op.py
覆盖场景: 单测覆盖了transpose/as_strided、广播、0dim、0size等多种情况，最大限度保证了正确性

paddle-bot · 2025-08-15T07:33:03Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

codecov-commenter · 2025-08-15T12:59:22Z

Codecov Report

❌ Patch coverage is 0% with 272 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8321bbb). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
paddle/phi/kernels/funcs/dense_tensor_iterator.cc	0.00%	261 Missing ⚠️
paddle/phi/kernels/funcs/dense_tensor_iterator.h	0.00%	11 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #74637   +/-   ##
==========================================
  Coverage           ?    0.00%           
==========================================
  Files              ?        2           
  Lines              ?      272           
  Branches           ?        0           
==========================================
  Hits               ?        0           
  Misses             ?      272           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Eddie-Wang1120 · 2025-08-18T01:28:19Z

/re-run all-failed

xiaoguoguo626807 · 2025-08-18T02:15:56Z

paddle/common/flags.cc

+ * Example:
+ * Note: Whether use DensetensorIterator.
+ */
+PHI_DEFINE_EXPORTED_bool(use_densetensor_iterator,


建议改成use_stride_compute_kernel，内部设计的名称对外不容易理解

zyfncg · 2025-08-18T11:19:39Z

paddle/phi/kernels/funcs/dense_tensor_iterator.cc

+  return numel;
+}
+
+void* DenseTensorIteratorBase::data_ptr(int64_t arg) const {


这里为什么不是返回 const void*

已更改为const void*

zyfncg · 2025-08-18T11:28:41Z

paddle/phi/kernels/funcs/dense_tensor_iterator.h

+
+enum class FastSetupType : uint8_t { NONE, CONTIGUOUS };
+
+struct DenseOperandInfo {


这里为什么用struct?

已统一均为struct

zyfncg · 2025-08-18T11:29:49Z

paddle/phi/kernels/funcs/dense_tensor_iterator.h

+struct DenseTensorIteratorBase {
+  void build(DenseTensorIteratorConfig&);
+  int ndim() const { return static_cast<int>(shape_.size()); }
+  std::vector<int64_t> shape() const { return shape_; }


返回 const &类型，避免拷贝开销

zyfncg · 2025-08-18T11:30:03Z

paddle/phi/kernels/funcs/dense_tensor_iterator.h

+  int64_t numel() const;
+  int ntensors() const { return static_cast<int>(operands_.size()); }
+  bool is_contiguous() const;
+  std::vector<int64_t> strides(int64_t arg) const {


zyfncg · 2025-08-18T11:30:46Z

paddle/phi/kernels/funcs/dense_tensor_iterator.h

+                              std::vector<int64_t> strides) override;
+};
+
+class DenseTensorIteratorConfig final {


多个类的功能和之间关系可以加一些注释说明

zyfncg · 2025-08-18T11:38:09Z

paddle/phi/kernels/stride/elementwise_kernel.cu

+#include "paddle/phi/common/complex.h"
+#include "paddle/phi/common/float16.h"
+#endif
+#include "paddle/phi/api/lib/data_transform.h"


api属于上层接口，kenrel属于底层接口，底层依赖上层会有循环依赖的问题

Eddie-Wang1120 · 2025-08-19T02:48:17Z

/re-run all-failed

XiaoguangHu01

LGTM

…mentwise_Add (PaddlePaddle#74637) * add densetensor_iterator * add HIP config * set flag to true * fix stride kernel bug * add strided input test * change flag name and add standard kernel defination * refine * fix codestyle

add densetensor_iterator

bedbd5f

add HIP config

e2b7c35

Eddie-Wang1120 added 3 commits August 15, 2025 14:06

set flag to true

b8eb732

fix stride kernel bug

1e2e25d

add strided input test

7d806e2

xiaoguoguo626807 reviewed Aug 18, 2025

View reviewed changes

Eddie-Wang1120 added 2 commits August 18, 2025 03:13

change flag name and add standard kernel defination

7065323

refine

674a2f7

Eddie-Wang1120 changed the title ~~[WIP] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add~~ [Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add Aug 18, 2025

xiaoguoguo626807 previously approved these changes Aug 18, 2025

View reviewed changes

XieYunshen added the skip-ci: coverage label Aug 18, 2025

luotao1 previously approved these changes Aug 18, 2025

View reviewed changes

zyfncg reviewed Aug 18, 2025

View reviewed changes

fix codestyle

e4e09f9

Eddie-Wang1120 dismissed stale reviews from luotao1 and xiaoguoguo626807 via e4e09f9 August 18, 2025 12:48

zyfncg approved these changes Aug 19, 2025

View reviewed changes

swgu98 removed the skip-ci: coverage label Aug 19, 2025

XiaoguangHu01 approved these changes Aug 19, 2025

View reviewed changes

luotao1 approved these changes Aug 19, 2025

View reviewed changes

xiaoguoguo626807 merged commit 2434431 into PaddlePaddle:develop Aug 19, 2025
134 of 141 checks passed


		enum class FastSetupType : uint8_t { NONE, CONTIGUOUS };

		struct DenseOperandInfo {

[Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add #74637

[Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add #74637

Uh oh!

Conversation

Eddie-Wang1120 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

PR简介：基于DenseTensorIterator的算子计算机制优化

一、简介

二、整体设计

三、实现方法

1. Strided Kernel

2. Risk Control (Flag)

3. Paddle TensorIterator (DenseTensorIterator)

4. CUDA strided_kernel_impl

四、相关影响

1. 正确性影响

2. 性能影响

Transpose + Add 性能测试

As_strided + Add 性能测试

五、风险管理

Uh oh!

paddle-bot bot commented Aug 15, 2025

Uh oh!

codecov-commenter commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Eddie-Wang1120 commented Aug 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Eddie-Wang1120 commented Aug 19, 2025

Uh oh!

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Eddie-Wang1120 commented Aug 15, 2025 •

edited

Loading

codecov-commenter commented Aug 15, 2025 •

edited

Loading