Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon No.33】为 Paddle 优化 erfinv op 在 GPU 上的计算性能 #199

Merged
merged 7 commits into from
Aug 17, 2022

Conversation

thunder95
Copy link
Contributor

为 Paddle 优化 erfinv op 在 GPU 上的计算性能
任务:PaddlePaddle/Paddle#44072 (comment)

| Case No. | device | input_shape | input_type | Paddle Perf(ms) |
|---|---|---|---|---|
| 1 | RTX 2070s | [-1L, 204800L] | float32 | 0.1438 |
| 2 | RTX 2070s |[10L, 20L, 30L, 40L, 5L, 6L] | float64 8| 8.6485 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float64 8 这块数据好像有些问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

笔误,已纠正


Pytorch中对Erfinv算子的实现基于GPU计算, forward整体性能如下(基于pytorch v1.12):

| Case No. | device | input_shape | input_type | Paddle Perf(ms) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddle Perf(ms) 这部分是不是应该改成 Pytorch Perf(ms)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

笔误,已纠正


## 2.1 关键模块与性能提升点

通过使用飞桨内部的Elementwise Kernel来进行计算。通过向量化读取、向量化写入以及gpu_launch_config.h中的线程配置方法对算子进行优化,预计提升1.2倍。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

性能提升预估1.2x倍提升后,数值上之后距离torch的性能还有差异,可以尝试看下底层C++端二者是否还有什么实现差异。

Copy link
Contributor Author

@thunder95 thunder95 Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesLim-sy 尝试了torch的c++实现方式,也尝试了ndtri函数实现,性能没有明显提升。最终使用cuda内置函数,得到了2倍以上的提升,相比torch也有1倍以上的提升。

Copy link

@ZzSean ZzSean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZzSean ZzSean merged commit 433c68b into PaddlePaddle:master Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants