Speedup FP16 Gelu op using fast math and vectorized 8 kernel #38980

sneaxiy · 2022-01-15T16:27:03Z

PR types

Performance optimization

PR changes

OPs

Describe

Speed up FP16 op using: (1) vectorized 8 kernel, since GPU has PTX ld instruction to load 4x32bit data; (2) use the PTX fast tanhf instruction tanh.approx.fp32 to speed up the tanhf function. It is enabled when FLAGS_use_fast_math=1.

paddle-bot-old · 2022-01-15T16:27:07Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

limin2021 · 2022-01-17T04:39:15Z

在PR描述里可否提供更详细的性能测试结果：(1) 56*seq_len的取值范围更广一些（比如，seq_len_in_batch为30几，40几在真实数据中也是有的）（2）添加与nv mlperf 1.1中 jit gelu的对比

sneaxiy force-pushed the speedup_gelu branch from b305d11 to 7bb61a1 Compare January 15, 2022 16:28

sneaxiy force-pushed the speedup_gelu branch from 7bb61a1 to 3839330 Compare January 15, 2022 16:57

sneaxiy force-pushed the speedup_gelu branch from 3839330 to 4ee18cb Compare January 16, 2022 13:52

speedup gelu using fast math

665a24c

sneaxiy force-pushed the speedup_gelu branch from 4ee18cb to 665a24c Compare January 16, 2022 14:04

sneaxiy requested review from limin2021 and lanxianghit January 17, 2022 02:12

add bwd part

2842e56

limin2021 approved these changes Jan 17, 2022

View reviewed changes

lanxianghit approved these changes Jan 18, 2022

View reviewed changes

sneaxiy merged commit 8c20d66 into PaddlePaddle:develop Jan 18, 2022

sneaxiy deleted the speedup_gelu branch January 18, 2022 02:20

sneaxiy mentioned this pull request Jan 19, 2022

Fix gelu op compilation error on CUDA 10 #39045

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup FP16 Gelu op using fast math and vectorized 8 kernel #38980

Speedup FP16 Gelu op using fast math and vectorized 8 kernel #38980

sneaxiy commented Jan 15, 2022 •

edited

Loading

paddle-bot-old bot commented Jan 15, 2022

limin2021 commented Jan 17, 2022

Speedup FP16 Gelu op using fast math and vectorized 8 kernel #38980

Speedup FP16 Gelu op using fast math and vectorized 8 kernel #38980

Conversation

sneaxiy commented Jan 15, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jan 15, 2022

limin2021 commented Jan 17, 2022

sneaxiy commented Jan 15, 2022 •

edited

Loading