-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add fp8_transpose fast_path #74911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fp8_transpose fast_path #74911
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
|
/re-run all-failed |
4 similar comments
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for unittest.skipIf。原因:因为涉及到fp8的特性,单测里使用unittest.skipIf skip了hopper架构以下的设备
|
/re-run all-failed |
PR Category
Operator Mechanism
PR Types
Performance
Description
Add fp8_transpose fast_path
benchmark result:
[Fused]Transpose 2D(7168,16384) with paddle.float8_e4m3fn
Average time over 1000 runs: 0.0883 ms
Throughput: 2477.46 GB/s
[Fused]Transpose 3D(8,7168,4096) with paddle.float8_e4m3fn
Average time over 1000 runs: 0.1660 ms
Throughput: 2634.87 GB/s
[Fused]Transpose 3D(8,2048,7168) with paddle.float8_e4m3fn
Average time over 1000 runs: 0.0849 ms
Throughput: 2578.05 GB/s
.F[Framework] Transpose 2D(7168,16384) with paddle.float8_e4m3fn
Average time over 1000 runs: 0.2231 ms
Throughput: 980.63 GB/s
[Framework] Transpose 3D(8,7168,4096) with paddle.float8_e4m3fn
Average time over 1000 runs: 0.4424 ms
Throughput: 988.88 GB/s
[Framework] Transpose 3D(8,2048,7168) with paddle.float8_e4m3fn
Average time over 1000 runs: 0.2230 ms
Throughput: 981.06 GB/s
pcard-91067