-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support optional residual add in fused_attention and fused_feedforward. #43474
Support optional residual add in fused_attention and fused_feedforward. #43474
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
606c594
to
85d5041
Compare
e09e57f
to
de26128
Compare
da294c9
to
eceed0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for fused_attention.
@@ -454,6 +459,7 @@ def fused_multi_head_attention(x, | |||
- train: out = input * mask | |||
- inference: out = input * (1.0 - p) | |||
ring_id (int, optional): For distributed forward in mp, only support NCCL and forward. Default is -1, means not using mp | |||
add_residual (bool, optional): Whether add residual at the end. Default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 当前API在大模型推理中有用到,增加attr对推理无影响。
- 对上面文档 code-block:: python 里公式 也更新下吧。 原始的功能,默认是True吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原始的功能,默认是True吗?
是的
对上面文档 code-block:: python 里公式 也更新下吧
我下个PR修改吧。
@Xreki 请下个 PR 对算子参数的修改添加 |
@Shixiaowei02 跟@cyj1986 确认过了,新增的 |
…d. (PaddlePaddle#43474) * Support optional residual add in fused_attention and fused_feedforward. * Add checkpoint and add the check of add_residual when pre_layer_norm is false. * Add TODO and change the python api to add add_residual argument.
…d. (PaddlePaddle#43474) * Support optional residual add in fused_attention and fused_feedforward. * Add checkpoint and add the check of add_residual when pre_layer_norm is false. * Add TODO and change the python api to add add_residual argument.
PR types
Function optimization
PR changes
OPs
Describe
develop中fused_attention op对等的小算子组网代码如下:
CAE模型里面的Attention用法略有不同,Attention结构中主要包括以下几点:
head_dim = dim // num_heads if attn_head_dim is None attn_head_dim
,允许指定attn_head_dim为其他值。模型中使用默认值。self.scale = qk_scale or head_dim ** -0.5
,qk计算结果的scale系数,可以指定为其他值。模型中使用默认值。q_bias
和v_bias
、无k_bias
。模型中通过如下方式,每次计算合并的QKV矩阵乘之前,将k_bias
的值置为0。Attention使用的地方也有所不同,代码如下:
若
self.drop_path
的dropout_prob
参数为0、self.gamma_1
和self.gamma_2
为None,x = x + self.attn(self.norm1(x), bool_masked_pos)
和x = x + self.mlp(self.norm2(x))
可直接调用fused_attention
和fused_feedforward
融合算子。实际模型中,
self.drop_path
中的dropout_prob
为0,但self.gamma_1
和self.gamma_2
都不为None。若要使用fused_attention
和fused_feedforward
融合算子,需要对算子功能进行改动,有2种方案:fused_attention
和fused_feedforward
中加上乘gamma
计算。好处是融合粒度更大;坏处是使得融合算子更加复杂,另外一旦修改self.drop_path
中的dropout_prob
值,则依然不能使用融合算子,相对来说通用性更低。fused_attention
和fused_feedforward
中增加一个add_residual
属性,用以控制最后一步是否进行加residual操作。当前PR采用的是方案2,且已经在CAE模型中验证,模型性能提升7%。