[Paddle Inference] Add masked multihead attention kernel and export API. #55344

xiaoxiaohehe001 · 2023-07-11T18:29:36Z

PR types

Others

PR changes

OPs

Description

Support masked multihead attention for transformer decoder stage.
Support dtype：[float | float16 | bfloat16]
Support int8 qkv out scale and outlinear in scale
export python api： from paddle.incubate.nn.functional import masked_multihead_attention

Pcard-71502

paddle-bot · 2023-07-11T18:29:41Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

MARD1NO · 2023-07-13T10:49:07Z

paddle/phi/api/yaml/ops.yaml

@@ -1541,6 +1541,17 @@
    data_type : logits
  backward : margin_cross_entropy_grad

+- op : masked_multihead_attention_


感觉叫decoder_masked_multihead_attention_比较合适

because of Integrating rotary_embedding, attention and etc, it should be named fused_masked_multihead_attention according to regulations

MARD1NO · 2023-07-13T10:49:47Z

paddle/phi/infermeta/multiary.cc

+    out->set_dtype(x_dtype);
+  }
+
+  PADDLE_ENFORCE_EQ(


感觉要check下 x的seq_len那一维必须为1

vivienfanghuagood · 2023-07-13T11:08:32Z

paddle/phi/api/yaml/ops.yaml

@@ -1541,6 +1541,17 @@
    data_type : logits
  backward : margin_cross_entropy_grad

+- op : masked_multihead_attention_
+  args : (Tensor x, Tensor bias, Tensor src_mask, Tensor sequence_lengths, Tensor rotary_tensor, Tensor beam_cache_offset, Tensor cache_kv, Tensor qkv_out_scale, Tensor out_linear_shift, Tensor out_linear_smooth, int beam_size, int rotary_emb_dims, bool mask_broadcast_num_heads=true, bool compute_bias=false, bool use_neox_rotary_style=false, float out_linear_in_scale=-1, int quant_round_type=1, float quant_max_bound=127.0, float quant_min_bound=-127.0)
+  output : Tensor(out), Tensor(cache_kv_out), Tensor(beam_cache_offset_out)


out_linear_in_scale、out_linear_shift、out_linear_smooth这些变量都是在标准的attention之外融合的部分，需要添加对输入的说明吧，或者名字可以考虑换一下

vivienfanghuagood · 2023-07-13T11:11:56Z

paddle/phi/infermeta/multiary.cc

+                                       const float quant_min_bound,
+                                       MetaTensor* out,
+                                       MetaTensor* cache_kv_out,
+                                       MetaTensor* beam_cache_offset_out) {


beam_cache_offset的输出也不知道含义

vivienfanghuagood · 2023-07-13T11:14:16Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+        src_mask (Tensor): The src_mask tensor. the shape is `[batch\_size, 1, 1, sequence\_length]`.
+        sequence_lengths (Tensor, optional): The sequence_lengths tensor. the shape is `[batch\_size, 1]`.
+        rotary_tensor (Tensor, optional): The rotary_tensor tensor. the shape is `[batch\_size, 1]`.
+        beam_cache_offset (Tensor, optional): The rotary_tensor tensor. the shape is `[batch\_size, beam\_size, max\_seq\_len + max\_dec\_len]`.


这里注释写错了吧

vivienfanghuagood · 2023-07-13T11:14:35Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+        rotary_tensor (Tensor, optional): The rotary_tensor tensor. the shape is `[batch\_size, 1]`.
+        beam_cache_offset (Tensor, optional): The rotary_tensor tensor. the shape is `[batch\_size, beam\_size, max\_seq\_len + max\_dec\_len]`.
+        cache_kvs (list(Tensor)|tuple(Tensor)): The cache structure tensors for the generation model. The shape is `[2, bsz, num\_head, max\_seq\_len, head\_dim]`.
+        rotary_tensor (Tensor, optional): The rotary_tensor tensor. the shape is `[batch\_size, 1, 1, sequence\_length, dim_head]`.


vivienfanghuagood · 2023-07-13T11:15:34Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+        bias (Tensor, optional): The bias tensor of qkv, the shape is `[3, num\_head, dim\_head]`.
+        src_mask (Tensor): The src_mask tensor. the shape is `[batch\_size, 1, 1, sequence\_length]`.
+        sequence_lengths (Tensor, optional): The sequence_lengths tensor. the shape is `[batch\_size, 1]`.
+        rotary_tensor (Tensor, optional): The rotary_tensor tensor. the shape is `[batch\_size, 1]`.


rotary_tensor没有限定类别，但是kernel里限定了float类别

vivienfanghuagood · 2023-07-13T11:16:51Z

test/legacy_test/test_masked_multihead_attention_op.py

+        self.num_head = 6
+        self.dim_head = 32
+        self.beam_size = 1
+        self.max_seq_len = 6


这个单测的seq_len太小了，很难覆盖真实的模型的输入

vivienfanghuagood · 2023-07-13T11:17:39Z

test/legacy_test/test_masked_multihead_attention_op.py

+        np.testing.assert_allclose(
+            paddle_mmha_out[0].numpy(),
+            paddle_naive_rmsnorm[0].numpy(),
+            rtol=5e-2,


5e-2的相对误差是不是太大了？

paddle-ci-bot · 2023-07-30T03:07:51Z

Sorry to inform you that 8041fad's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

qingqing01 · 2023-08-08T08:36:17Z

paddle/phi/kernels/funcs/mmha_util.cu.h

+#include "paddle/fluid/operators/fused/fmha_ref.h"
+#include "paddle/fluid/operators/fused/fused_dropout_helper.h"
+#include "paddle/fluid/platform/device/gpu/gpu_dnn.h"
+#include "paddle/fluid/platform/dynload/cublasLt.h"


这里不是所有的include都有用吧，至少op_registry.h没用到吧

其他文件也注意下这个问题

qingqing01 · 2023-08-08T08:38:04Z

paddle/phi/kernels/funcs/mmha_util.cu.h

+#include "paddle/fluid/distributed/collective/process_group.h"
+#include "paddle/fluid/platform/collective_helper.h"
+#include "paddle/fluid/platform/device/gpu/nccl_helper.h"
+#endif


应该也没用到 nccl吧，没用到的清理掉吧

qingqing01 · 2023-08-08T08:57:35Z

paddle/phi/kernels/impl/masked_multihead_attention_impl.h

+                  const int quant_round_type = 1,
+                  const float quant_max_bound = 127.0f,
+                  const float quant_min_bound = -127.0f) {
+  if (dequant_qkv_scales != nullptr && quant_fmha_out_scale > 0) {


better to add commits for difference branch

qingqing01 · 2023-08-08T08:58:36Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+# limitations under the License.
+
+from paddle import _C_ops
+from paddle.fluid.layer_helper import LayerHelper


from paddle.framework import LayerHelper

qingqing01 · 2023-08-08T08:58:59Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+    quant_min_bound=-127.0,
+):
+    r"""
+    Multi-head attention for text summarization.


text generation

qingqing01 · 2023-08-08T09:05:13Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+    qkv_out_scale=None,
+    out_linear_shift=None,
+    out_linear_smooth=None,
+    seq_len=1,


下面没有解释 seq_len 含义

qingqing01 · 2023-08-08T09:05:29Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+        rotary_emb_dims (int, optional): The rotary_emb_dims. Default 0.
+        use_neox_rotary_style (bool, optional): A flag indicating whether neox_rotary_style is needed or not. Default False.
+        out_linear_in_scale (float, optional): The out_linear_in_scale.
+        quant_round_type (int, optional): The quant_round_type. Default 1.


round有哪些type类型？

qingqing01 · 2023-08-08T09:06:12Z

test/legacy_test/test_masked_multihead_attention_op.py

+
+import paddle
+from paddle.fluid import core
+from paddle.fluid.layer_helper import LayerHelper


同上，非必要不用fluid API

qingqing01 · 2023-08-08T09:06:56Z

test/legacy_test/test_masked_multihead_attention_op.py

+from paddle.framework import in_dynamic_mode
+
+
+def mmha_wrapper(


上面已经封装接口，这个接口看起来没必要吧

qingqing01 · 2023-08-08T09:07:46Z

test/legacy_test/test_masked_multihead_attention_op.py

@@ -0,0 +1,552 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.


为什么放到 legacy_test 文件夹？

Ligoml · 2023-08-14T11:15:23Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+    quant_max_bound=127.0,
+    quant_min_bound=-127.0,
+):
+    r"""


新增api需要在docs下同步增加中文文档~

好的后续会添加上

Ligoml · 2023-08-14T11:16:12Z

python/paddle/incubate/nn/functional/masked_multihead_attention.py

+
+    Args:
+        x (Tensor): The input tensor could be 2-D tensor. Its shape is [batch_size, 3 * num_head * head_dim].
+        cache_kvs (list(Tensor)|tuple(Tensor)): The cache structure tensors for the generation model. Its shape is [2, batch_size, num_head, max_seq_len, head_dim].


Suggested change

cache_kvs (list(Tensor)|tuple(Tensor)): The cache structure tensors for the generation model. Its shape is [2, batch_size, num_head, max_seq_len, head_dim].

cache_kvs (list(Tensor)|tuple(Tensor), optional): The cache structure tensors for the generation model. Its shape is [2, batch_size, num_head, max_seq_len, head_dim].

cache_kvs 并非 optional 输入

但是它有默认值呀，上面不是写了 cache_kv=None

zyfncg · 2023-08-14T11:20:59Z

paddle/phi/kernels/funcs/mmha_util.cu.h

+namespace plat = paddle::platform;
+using float16 = plat::float16;
+using bfloat16 = plat::bfloat16;


phi下面应该没有paddle::platform的namespace, 这里可以用phi::dtype替换

zyfncg · 2023-08-14T11:23:32Z

paddle/phi/kernels/fusion/gpu/masked_multihead_attention.h

+#include "paddle/fluid/memory/memcpy.h"
+#include "paddle/fluid/platform/profiler.h"


phi在独立编译为动态链接库后，已经不允许引用fluid目录的头文件

收到,Done~

zyfncg · 2023-08-14T11:25:06Z

paddle/phi/kernels/fusion/gpu/masked_multihead_attention.h

+template <typename T, typename Context>
+void MMHAKernel(const Context& dev_ctx,


fusion类kernel不用创建头文件声明，避免被各处使用

zyfncg · 2023-08-14T11:27:24Z

paddle/phi/kernels/impl/masked_multihead_attention_impl.h

看上去这是一个纯GPU的实现代码，不用写在impl目录下，直接写到gpu kernel的.cu文件里即可

chenwhql · 2023-08-15T02:07:50Z

paddle/phi/kernels/fusion/gpu/masked_multihead_attention.h

+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/kernels/fusion/gpu/masked_multihead_attention_utils.h"


这个头文件是不是不需要了，它会把glog头文件引过来，被拦截了

这个后续提pr 合并

jeff41404 · 2023-08-15T02:17:48Z

python/paddle/incubate/nn/functional/__init__.py

@@ -45,4 +46,5 @@
    'variable_length_memory_efficient_attention',
    "fused_rms_norm",
    "fused_layer_norm",
+    "masked_multihead_attention",


because of Integrating rotary_embedding, attention and etc, it should be named fused_masked_multihead_attention according to regulations

jeff41404 · 2023-08-15T02:22:45Z

paddle/phi/api/yaml/ops.yaml

+    func : masked_multihead_attention
+    data_type : cache_kv
+  optional : src_mask, cum_offsets, sequence_lengths, rotary_tensor, beam_cache_offset, qkv_out_scale, out_shift, out_smooth
+  inplace : (cache_kv -> cache_kv_out), (beam_cache_offset -> beam_cache_offset_out)


should add OP(eg. backward: fused_masked_multihead_attention_grad) to compute gradient according to regulations, otherwise it cannot be used for training

masked_multihead_attention 目前只用于推理，反向后续是否添加需要再讨论

jiahy0825

LGTM for include "logging.h" in paddle/phi/kernels/fusion/gpu/masked_multihead_attention_utils.h temporarily.
Please create another PR to remove this line later.

raindrops2sea

LGTM

vivienfanghuagood · 2023-08-15T09:07:25Z

test/legacy_test/test_masked_multihead_attention_op.py

+@unittest.skipIf(
+    not core.is_compiled_with_cuda(), "core is not compiled with CUDA"
+)
+class TestLayerNormStaticInt8Op(unittest.TestCase):


这个命名是不是错了

support_mmha

bcff596

yangjianfengo1 and others added 3 commits July 13, 2023 12:30

add_python_api

ac5ca55

add_api_doc

ff0dd45

fix_doc_error

057023d

xiaoxiaohehe001 changed the title ~~[Paddle Inference] support_mmha for inference.~~ [Paddle Inference] Add MMHAKernel for inference and export mmha API. Jul 13, 2023

xiaoxiaohehe001 changed the title ~~[Paddle Inference] Add MMHAKernel for inference and export mmha API.~~ [Paddle Inference] Add MMHAKernel and export mmha API. Jul 13, 2023

xiaoxiaohehe001 changed the title ~~[Paddle Inference] Add MMHAKernel and export mmha API.~~ [Paddle Inference] Add masked multihead attention kernel and export API. Jul 13, 2023

fix_infermeta

3dec850

MARD1NO reviewed Jul 13, 2023

View reviewed changes

vivienfanghuagood reviewed Jul 13, 2023

View reviewed changes

xiaoxiaohehe001 added 11 commits July 14, 2023 11:29

add_infermeta

29f3f9e

add_bf16_cuda_check

e9c6ee3

fix_bf16

199aca4

fix_ci_bloat16

2da0edd

add_bf16_check

b31fe6c

fix_bfloat16

2c2697e

fix_bfloat16

371268f

fix_ci_windows

0ce5b0e

fix_ci_windows

ad3c325

fix_ci_windows

edc8c5d

fix_ci_windows

42282f8

qingqing01 self-requested a review July 17, 2023 11:42

fix_ci_windows_kernel_register

70ff789

remove_bias

8041fad

xiaoxiaohehe001 added 2 commits August 8, 2023 03:29

delete_mmha_reshape_input_output

15b2fd5

fix_api_log

58b358c

qingqing01 reviewed Aug 8, 2023

View reviewed changes

xiaoxiaohehe001 and others added 3 commits August 9, 2023 01:49

rename_delete_hfile

bf750ec

add_license_nv

37b4d72

Merge branch 'develop' into support_mmha

b9907cc

Aurelius84 previously approved these changes Aug 14, 2023

View reviewed changes

Ligoml reviewed Aug 14, 2023

View reviewed changes

zyfncg reviewed Aug 14, 2023

View reviewed changes

qingqing01 previously approved these changes Aug 14, 2023

View reviewed changes

remove_fluid

5470349

xiaoxiaohehe001 dismissed stale reviews from qingqing01 and Aurelius84 via 5470349 August 14, 2023 13:57

chenwhql reviewed Aug 15, 2023

View reviewed changes

jeff41404 reviewed Aug 15, 2023

View reviewed changes

jiahy0825 approved these changes Aug 15, 2023

View reviewed changes

chenwhql approved these changes Aug 15, 2023

View reviewed changes

zyfncg approved these changes Aug 15, 2023

View reviewed changes

Ligoml approved these changes Aug 15, 2023

View reviewed changes

Aurelius84 approved these changes Aug 15, 2023

View reviewed changes

qingqing01 approved these changes Aug 15, 2023

View reviewed changes

raindrops2sea approved these changes Aug 15, 2023

View reviewed changes

qingqing01 merged commit 989c5e8 into PaddlePaddle:develop Aug 15, 2023

vivienfanghuagood reviewed Aug 15, 2023

View reviewed changes

This was referenced Aug 16, 2023

[Paddle Inference] Fix masked multihead attention kernel and remove hfiles. #56344

Closed

[Paddle Inference] Add bias input of mmha and simplify mmha. #56411

Merged

		from paddle.framework import in_dynamic_mode


		def mmha_wrapper(

		@@ -0,0 +1,552 @@
		# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.

	cache_kvs (list(Tensor)\|tuple(Tensor)): The cache structure tensors for the generation model. Its shape is [2, batch_size, num_head, max_seq_len, head_dim].
	cache_kvs (list(Tensor)\|tuple(Tensor), optional): The cache structure tensors for the generation model. Its shape is [2, batch_size, num_head, max_seq_len, head_dim].

		#include "paddle/fluid/memory/memcpy.h"
		#include "paddle/fluid/platform/profiler.h"

		template <typename T, typename Context>
		void MMHAKernel(const Context& dev_ctx,

[Paddle Inference] Add masked multihead attention kernel and export API. #55344

[Paddle Inference] Add masked multihead attention kernel and export API. #55344

Conversation

xiaoxiaohehe001 commented Jul 11, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Jul 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-ci-bot bot commented Jul 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoxiaohehe001 Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoxiaohehe001 Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

jiahy0825 left a comment

Choose a reason for hiding this comment

raindrops2sea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoxiaohehe001 commented Jul 11, 2023 •

edited

Loading

xiaoxiaohehe001 Aug 14, 2023 •

edited

Loading

xiaoxiaohehe001 Aug 15, 2023 •

edited

Loading