Skip to content

Conversation

@Eddie-Wang1120
Copy link
Contributor

@Eddie-Wang1120 Eddie-Wang1120 commented Aug 15, 2025

PR Category

Execute Infrastructure

PR Types

Others

Description

pcard-67164

PR简介:基于DenseTensorIterator的算子计算机制优化

一、简介

本PR以Paddle DenseTensor数据结构为基础,设计并实现了基于DenseTensorIterator的算子计算机制,用于在处理非连续输入时获得性能提升/显存占用降低以及正确性生态对齐。

二、整体设计

流程图-202508181405 (1)

三、实现方法

本PR主要实现了设计中的Paddle TensorIterator以及其相关的通用CUDA Stride Kernel部分,通过以上实现确保了elementwise_add算子性能及正确性提升,下面由上到下讲解具体实现。

1. Strided Kernel

该部分需要注册并实现一个GPU上的stride kernel。

  • 实现位置: paddle/phi/kernels/stride/elementwise_kernel.cu
  • 实现方法: 使用PD_REGISTER_KERNEL进行定义,并在elementwise_kernel.cu中实现
  • 函数命名规则: XXStrideKernel
  • 扩展机制: 为方便后续扩量生成StrideKernel,已使用宏语言编程方式做了处理,后续只需要通过例如DEFINE_CUDA_BINARY_ELEMENTWISE_STRIDE_OP(Add)的方式生成函数即可

2. Risk Control (Flag)

该部分需要实现一个用于风险管理的Flag,预防新增功能不稳定带来的正确性以及性能影响。

  • 实现位置: paddle/common/flags.cc
  • Flag名称: FLAGS_use_stride_compute_kernel
  • 使用方式:
  • 当使用FLAGS_use_stride_compute_kernel=1运行Paddle程序时,Kernel会分派到新的实现方法,直接处理非连续输入运算
  • 反之则会回退到之前版本,先将输入变连续再进行运算

3. Paddle TensorIterator (DenseTensorIterator)

该部分需要实现一个对任意张量进行运算前预处理的类,目前命名为DenseTensorIterator

  • 实现位置:
  • paddle/phi/kernels/funcs/dense_tensor_iterator.cc
  • paddle/phi/kernels/funcs/dense_tensor_iterator.h
  • 主要功能: DenseTensorIterator保证了张量进入通用CUDA kernel前拥有最佳的shape/stride以供后续运算,同时保证了输出张量的连续性与生态对齐

4. CUDA strided_kernel_impl

该部分需要实现一个通用的CUDA kernel,用于处理DenseTensorIterator给出的shape和stride,并按照Functor的定义完成运算。

  • 实现位置: paddle/phi/kernels/stride/elementwise_kernel.cu
  • 实现方法: 使用Slice优化时引入的OffsetCalculator进行index计算,之后复用KP中的Functor相关实现进行运算
  • 编译优化: 为了保证不带来过多的编译体积增加,本PR去除了KP相关的模版,只保留type以及Functor相关模版,保证了编译体积可控

四、相关影响

1. 正确性影响

FLAGS_use_stride_compute_kernel=1时,elementwise_add将按照如下规则进行输出处理:

对于两个参与二元Elementwise操作的操作数,一个必要的前提是需要将两个操作数能广播到一个共同的尺寸上,我们定义这个尺寸为广播尺寸

  • 如果两个操作数的尺寸都为广播尺寸,那么输出的stride由第一个操作数决定(按顺序决定)
  • 如果其中一个操作数需要广播到广播尺寸,那么输出的stride由与广播尺寸相同的操作数决定(与顺序无关)

FLAGS_use_stride_compute_kernel=0时,elementwise_add将全输出连续张量。

2. 性能影响

Transpose + Add 性能测试

Shape Before (μs) After (μs) Improvement
64×108×12288 2262.79 1243.12 81.9% faster

As_strided + Add 性能测试

Shape Before (μs) After (μs) Improvement
54×32×12288 651.61 451.13 44.3% faster

五、风险管理

为保证FLAGS_use_stride_compute_kernel=1后正确性可以保证,对elementwise_add新增了单测。

  • 单测位置: test/legacy_test/test_elementwise_add_op.py
  • 覆盖场景: 单测覆盖了transpose/as_strided、广播、0dim、0size等多种情况,最大限度保证了正确性

@paddle-bot
Copy link

paddle-bot bot commented Aug 15, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2025

Codecov Report

❌ Patch coverage is 0% with 272 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8321bbb). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddle/phi/kernels/funcs/dense_tensor_iterator.cc 0.00% 261 Missing ⚠️
paddle/phi/kernels/funcs/dense_tensor_iterator.h 0.00% 11 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #74637   +/-   ##
==========================================
  Coverage           ?    0.00%           
==========================================
  Files              ?        2           
  Lines              ?      272           
  Branches           ?        0           
==========================================
  Hits               ?        0           
  Misses             ?      272           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Eddie-Wang1120
Copy link
Contributor Author

/re-run all-failed

* Example:
* Note: Whether use DensetensorIterator.
*/
PHI_DEFINE_EXPORTED_bool(use_densetensor_iterator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议改成use_stride_compute_kernel, 内部设计的名称对外不容易理解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Eddie-Wang1120 Eddie-Wang1120 changed the title [WIP] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add [Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add Aug 18, 2025
luotao1
luotao1 previously approved these changes Aug 18, 2025
return numel;
}

void* DenseTensorIteratorBase::data_ptr(int64_t arg) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么不是返回 const void*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已更改为const void*


enum class FastSetupType : uint8_t { NONE, CONTIGUOUS };

struct DenseOperandInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么用struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已统一均为struct

struct DenseTensorIteratorBase {
void build(DenseTensorIteratorConfig&);
int ndim() const { return static_cast<int>(shape_.size()); }
std::vector<int64_t> shape() const { return shape_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

返回 const &类型,避免拷贝开销

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int64_t numel() const;
int ntensors() const { return static_cast<int>(operands_.size()); }
bool is_contiguous() const;
std::vector<int64_t> strides(int64_t arg) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::vector<int64_t> strides) override;
};

class DenseTensorIteratorConfig final {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多个类的功能和之间关系可以加一些注释说明

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#include "paddle/phi/common/complex.h"
#include "paddle/phi/common/float16.h"
#endif
#include "paddle/phi/api/lib/data_transform.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api属于上层接口,kenrel属于底层接口,底层依赖上层会有循环依赖的问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Eddie-Wang1120
Copy link
Contributor Author

/re-run all-failed

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaoguoguo626807 xiaoguoguo626807 merged commit 2434431 into PaddlePaddle:develop Aug 19, 2025
134 of 141 checks passed
Luckycheng222 pushed a commit to Luckycheng222/Paddle that referenced this pull request Aug 25, 2025
…mentwise_Add (PaddlePaddle#74637)

* add densetensor_iterator

* add HIP config

* set flag to true

* fix stride kernel bug

* add strided input test

* change flag name and add standard kernel defination

* refine

* fix codestyle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants