-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon No.38】为 Paddle 优化 deformable_conv op 在 GPU 上的计算性能 #218
Conversation
@@ -0,0 +1,91 @@ | |||
Poisson OP性能优化设计文档 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
题目修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
# 1 背景与意义 | ||
|
||
目前Paddle中的Deformable_conv的GPU版是使用cuBlas+CUDA kernel实现的,kernel实现方式与论文原作者的实现类似是讲CPU的kernel迁移到了GPU上,对于CUDA代码未进行针对性优化。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
讲->将
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
对于此OP在目前飞桨框架(Develop分支)中的性能现状调研,表格形式列出[OP Benchmark](https://github.com/PaddlePaddle/benchmark/tree/master/api/tests_v2)中各种case场景下的OP性能数据(Tesla P4)。 | ||
|
||
### 时间分析 | ||
通过benchmark运行deformable_conv的测试,其中执行了前向和后向的过程,需要对二者使用时间进行拆分, 下表列出核心的几部分。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实运行op benchmark时将 --backward
设置为False
就可以,这样数据仅包含前向看起来更直观
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的 学会了🙏 已更新数据
|
||
根据kernels运行时间分析:65%时间消耗在No.5上,而cuBlas本身的实现难以有较大优化空间,所以单独优化两个kernel较难实现目标。 | ||
|
||
根据CUDA API时间分析:63%时间消耗在同步,26%时间消耗在内存分配。通过减少需要线程同步的次数,降低数据在内存和CUDA间迁移应该可以较大程度优化。故优化点1和优化点2是重点考虑的对象。总体来看运行过程是串行的,每个im2col结束后执行gemm,然后再进行下一个im2col,gemm的过程。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
降低数据在内存和CUDA间迁移,这里指的是Host与Device之间的数据拷贝吗?
这里能否将API的测试时间也补充一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
指的是Host与Device之间的数据拷贝的。
这里API数据只计算前向的话分布有更新,添加到时间描述一节中,此处也已修改描述,通过API的情况无法看不出什么明显的结论
+ 优化点1: 通过优化grid, block数量寻找更优配置。 | ||
+ 优化点2: 将deformable_conv_kernel_impl中计算像素和权重乘积的循环迁移到ModulatedDeformableIm2colGpuKernel中,将col_buffer的并行计算和output_3d的计算整合,减少部分搬运开销 | ||
+ 优化点3: 将deformable_conv_kernel_impl中的(batch_size / im2col_step)次循环并行化,目前用循环的方式im2col_step完成后才能进行下一个step,等待时间是无必要的。 | ||
+ 优化点4: 单独优化deformable_conv_kernel_impl中计算像素和权重乘积的循环; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
句尾标点符号还是统一一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
## 2.2 Host / Device 端计算流程 | ||
1. 针对优化点1: 考虑通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置,或手动对BlockSize的不同大小进行性能测试验证(可能有一定优化空间) | ||
|
||
2. 针对优化点2: ModulatedDeformableIm2colGpuKernel的Host端多接入两个参数,Device端计算完成col_buffer后继续计算output(可能有较大优化空间) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里能否描述的更详细些?如使用配图或者伪代码,因为这个优化在上述描述中应该是重点
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加伪代码描述
|
||
2. 针对优化点2: ModulatedDeformableIm2colGpuKernel的Host端多接入两个参数,Device端计算完成col_buffer后继续计算output(可能有较大优化空间) | ||
|
||
3. 针对优化点3: 将整个im2col_step的过程并行化形成新的kernel,包含im2col和gemm两个步骤(可能有较大优化空间) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加伪代码描述
|
||
## 3 测试和验收的考量 | ||
|
||
实现前向速度提升超过25% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能否给出提升超过25%的评估逻辑
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
请问下现在Paddle有办法使用动态并行吗?想实现在gpu kernel中调用blas计算矩阵乘法。 |
cublas函数都是直接在host端调用的,内部可以理解为就是调用了cuda kernel,所以无法在global函数中再调用global函数 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
提案内容有修改 辛苦审核