-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add #74637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stride] Set up DenseTensorIterator And Support Stride Kernel For Elementwise_Add #74637
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #74637 +/- ##
==========================================
Coverage ? 0.00%
==========================================
Files ? 2
Lines ? 272
Branches ? 0
==========================================
Hits ? 0
Misses ? 272
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run all-failed |
paddle/common/flags.cc
Outdated
| * Example: | ||
| * Note: Whether use DensetensorIterator. | ||
| */ | ||
| PHI_DEFINE_EXPORTED_bool(use_densetensor_iterator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议改成use_stride_compute_kernel, 内部设计的名称对外不容易理解
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| return numel; | ||
| } | ||
|
|
||
| void* DenseTensorIteratorBase::data_ptr(int64_t arg) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么不是返回 const void*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已更改为const void*
|
|
||
| enum class FastSetupType : uint8_t { NONE, CONTIGUOUS }; | ||
|
|
||
| struct DenseOperandInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么用struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已统一均为struct
| struct DenseTensorIteratorBase { | ||
| void build(DenseTensorIteratorConfig&); | ||
| int ndim() const { return static_cast<int>(shape_.size()); } | ||
| std::vector<int64_t> shape() const { return shape_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
返回 const &类型,避免拷贝开销
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| int64_t numel() const; | ||
| int ntensors() const { return static_cast<int>(operands_.size()); } | ||
| bool is_contiguous() const; | ||
| std::vector<int64_t> strides(int64_t arg) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| std::vector<int64_t> strides) override; | ||
| }; | ||
|
|
||
| class DenseTensorIteratorConfig final { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多个类的功能和之间关系可以加一些注释说明
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| #include "paddle/phi/common/complex.h" | ||
| #include "paddle/phi/common/float16.h" | ||
| #endif | ||
| #include "paddle/phi/api/lib/data_transform.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api属于上层接口,kenrel属于底层接口,底层依赖上层会有循环依赖的问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e4e09f9
|
/re-run all-failed |
XiaoguangHu01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2434431
into
PaddlePaddle:develop
…mentwise_Add (PaddlePaddle#74637) * add densetensor_iterator * add HIP config * set flag to true * fix stride kernel bug * add strided input test * change flag name and add standard kernel defination * refine * fix codestyle
PR Category
Execute Infrastructure
PR Types
Others
Description
pcard-67164
PR简介:基于DenseTensorIterator的算子计算机制优化
一、简介
本PR以Paddle DenseTensor数据结构为基础,设计并实现了基于DenseTensorIterator的算子计算机制,用于在处理非连续输入时获得性能提升/显存占用降低以及正确性生态对齐。
二、整体设计
三、实现方法
本PR主要实现了设计中的Paddle TensorIterator以及其相关的通用CUDA Stride Kernel部分,通过以上实现确保了elementwise_add算子性能及正确性提升,下面由上到下讲解具体实现。
1. Strided Kernel
该部分需要注册并实现一个GPU上的stride kernel。
paddle/phi/kernels/stride/elementwise_kernel.cuPD_REGISTER_KERNEL进行定义,并在elementwise_kernel.cu中实现XXStrideKernelDEFINE_CUDA_BINARY_ELEMENTWISE_STRIDE_OP(Add)的方式生成函数即可2. Risk Control (Flag)
该部分需要实现一个用于风险管理的Flag,预防新增功能不稳定带来的正确性以及性能影响。
paddle/common/flags.ccFLAGS_use_stride_compute_kernelFLAGS_use_stride_compute_kernel=1运行Paddle程序时,Kernel会分派到新的实现方法,直接处理非连续输入运算3. Paddle TensorIterator (DenseTensorIterator)
该部分需要实现一个对任意张量进行运算前预处理的类,目前命名为DenseTensorIterator。
paddle/phi/kernels/funcs/dense_tensor_iterator.ccpaddle/phi/kernels/funcs/dense_tensor_iterator.h4. CUDA strided_kernel_impl
该部分需要实现一个通用的CUDA kernel,用于处理DenseTensorIterator给出的shape和stride,并按照Functor的定义完成运算。
paddle/phi/kernels/stride/elementwise_kernel.cuOffsetCalculator进行index计算,之后复用KP中的Functor相关实现进行运算四、相关影响
1. 正确性影响
当
FLAGS_use_stride_compute_kernel=1时,elementwise_add将按照如下规则进行输出处理:对于两个参与二元Elementwise操作的操作数,一个必要的前提是需要将两个操作数能广播到一个共同的尺寸上,我们定义这个尺寸为广播尺寸。
当
FLAGS_use_stride_compute_kernel=0时,elementwise_add将全输出连续张量。2. 性能影响
Transpose + Add 性能测试
As_strided + Add 性能测试
五、风险管理
为保证
FLAGS_use_stride_compute_kernel=1后正确性可以保证,对elementwise_add新增了单测。test/legacy_test/test_elementwise_add_op.py