-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuse vit attention for faster-rcnn on BML #54139
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
❌ The PR is not created using PR's template. You can refer to this Demo. |
5c3c75b
to
3cb4b1f
Compare
25cb7d8
to
2f4146d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@YuanRisheng Rebased. Please help have a check again. Thanks! |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 考虑到算子库及OneDNN很多算子都已经迁移到Phi算子,融合算子也有已经迁移到Phi的,倾向于直接使用Phi实现。
- PR描述里也给下性能提升数据吧
很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。 |
@qingqing01 您好,1. 对于第一个问题,CI里提到要通过@risheng等人的review,和相关人员以及reviewer进行了讨论,会后@YuanRisheng 这边做了进一步的处理判断,最后确认目前加op的方法是可以的(LGTM)。 2. 性能提升相关的数据加到PR描述里了。 |
@abenmao 你好,paddle/fluid/operators/fused/self_dp_attention_op.h和self_dp_attention_op.cc文件引入之后,在gcc12下会报错一个未初始化的错误,幸苦修复一下 |
你好 @risemeup1 问题已收到,情况确认后会及时联系。 |
ok |
@risemeup1 你好,在gcc12.1下进行了排查,发现是gcc12.1这个版本已知的一个issue,具体bug参见链接 : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593 我们如果要workaround这个问题的话,可以在相关的地方加一个flag,用来把它进行隐藏,具体方法如下: |
这边幸苦提一个PR绕过一下吧,我觉得这样改没问题的 |
其实我理解这个bug是不是可以通过初始化赋值的方法也能解决啊 |
@risemeup1 这个未初始化的赋值是gnu自己里边的代码没有初始化: 我们目前只能workaround |
@risemeup1 刚要提pr,rebase发现你已经提交了~ |
对,已经合入了,比较急,所以就周末提了,幸苦 |
PR types
Performance optimization
PR changes
Others
Description
For Vison Transformer in Faster-RCNN model, adds a fuse that replaces the scaled_dp_attention in the self-attention with a single op.
On Cascade Lake, this optimization leads to a 1.67X speedup in the VisionTransformer block and a 1.46X improvement (reducing the latency from 3.8s to 2.6s) in the end-to-end inference time for Faster-RCNN model (from BML).
On Sapphire Rapids, the optimization has the potential to achieve a speedup of bf16/int8 by reducing the amount of memory access.