[LLM INFER] Append attn #9244

yuanlehome · 2024-10-11T07:33:05Z

PR types

New features

PR changes

Others

Description

大模型推理attention组网重构，新的append_attn方案相比旧方案有10%到90%的性能提升。

目前已支持了llama/qwen/qwen-moe/mixtral的推理。

使用方式，原推理脚本的 --block_attn选项改为--append_attn即可。

TODO：

fp8推理适配
性能数据补充，稍后见llm docs

…nto append_attn

Change-Id: Ibe8920ba41ea9775e676b05b12dc01cb9da95b5e

…nto append_attn

…tention

…nto append_attn

paddle-bot · 2024-10-11T07:33:10Z

Thanks for your contribution!

codecov · 2024-10-11T08:05:01Z

Codecov Report

Attention: Patch coverage is 0% with 60 lines in your changes missing coverage. Please review.

Project coverage is 52.74%. Comparing base (fe8b527) to head (84a6864).
Report is 264 commits behind head on develop.

Files with missing lines	Patch %	Lines
...erimental/transformers/fused_transformer_layers.py	0.00%	38 Missing ⚠️
...dlenlp/experimental/transformers/qwen2/modeling.py	0.00%	8 Missing ⚠️
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	7 Missing ⚠️
...enlp/experimental/transformers/mixtral/modeling.py	0.00%	5 Missing ⚠️
...lp/experimental/transformers/qwen2_moe/modeling.py	0.00%	1 Missing ⚠️
paddlenlp/experimental/transformers/utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #9244   +/-   ##
========================================
  Coverage    52.73%   52.74%           
========================================
  Files          661      661           
  Lines       107422   107371   -51     
========================================
- Hits         56653    56630   -23     
+ Misses       50769    50741   -28

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…nto append_attn

vivienfanghuagood · 2024-10-16T07:08:14Z

csrc/gpu/append_attn/speculate_write_cache_with_rope_impl.cuh

+              static_cast<uint8_t>(quant_value2 + 128.0f);
+        }
+        // write k
+        // 大分块 lane_id / 4 / 2


中文注释删一删

太多了，留着无伤大雅吧

…nto append_attn

…into append_attn

ZHUI · 2024-10-23T03:56:37Z

paddlenlp/experimental/transformers/llama/modeling.py

@@ -871,7 +873,7 @@ def set_state_dict(self, state_dict):
                    weight_scales_loader = EmptyWeightScale(
                        weight_scale_map_dict,
                        num_of_layers=self.config.num_hidden_layers,
-                        num_head=self.num_attention_heads,


这个一直没有生效，会导致什么问题？

这里是调试时改的，仅仅改下命名，不会有什么影响

ZHUI · 2024-10-23T03:57:28Z

paddlenlp/experimental/transformers/qwen2/modeling.py

@@ -835,7 +838,10 @@ def set_state_dict(self, state_dict):

                for k, v in cache_scales_loader.scale.items():
                    for i_layer, weight_scale in enumerate(v):
-                        weight_scale = weight_scale.astype("float32")
+                        if self.config.append_attn:
+                            weight_scale = paddle.to_tensor(weight_scale).cast(paddle.get_default_dtype())


为什么 append_attn 下可以不同 fp32？

因为kernel实现里是要求half精度的，访存不同

ZHUI · 2024-10-23T03:58:45Z

llm/predict/predictor.py

@@ -1684,8 +1692,8 @@ def benchmark(predictor, predictor_args, model_args):
    batch_benchmark_texts = batchfy_text(benchmark_texts, predictor_args.batch_size)
    print("***********Start Benchmark**********")

-    warmup_time = 10
-    test_time = 100


这个修改是为了？

这个修改没啥影响，没注意到提到commit上了

注意下个pr恢复吧

ZHUI · 2024-10-23T04:00:11Z

paddlenlp/experimental/transformers/mixtral/modeling.py

        super().__init__(config)
        self.max_seq_len = config.max_seq_len
        self.block_size = config.block_size

    def set_transformer_block(self, transformer_config):
        if self.use_weight_only:
            self.transformer_block = FusedBlockMultiTransformerWeightOnly(transformer_config)
-        elif "a8w8" in self.quant_type:


这个为什么删除？

因为并没有支持，之前同学参考其他代码时一并copy过来了，我这里顺便给删掉

* refine paddle::empty(), fix memory error, support multi_stream for attention * fix and rename attention as append_attention * rename file --------- Co-authored-by: lizhenyun <lizhenyun@baidu.com> Co-authored-by: lizhenyun01 <1500424927@qq.com>

yuanlehome and others added 30 commits September 14, 2024 14:05

append_attention 0914

b072465

paddle::empty to phi::allocator

b915f95

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

9b1e1d8

…nto append_attn

append_attn 0919

140a509

0920 fix split_kv_block

5272b6f

my change for merge 4 to 1

a42157d

fix prev

bec8eef

merge zhenyun 0923

8dab056

fix prev

d5047b5

fix var name

006a467

update

73e2c06

fix config

a8acb2b

fix

ec46a89

fix append_attn

cb02ee5

Change-Id: Ibe8920ba41ea9775e676b05b12dc01cb9da95b5e

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

83a19a6

…nto append_attn

fix --use_fake_parameter

37fc7da

refine paddle::empty(), fix memory error, support multi_stream for at…

a3b265b

…tention

fix and rename attention as append_attention

68a09b6

rename file

2bcd939

fix

74941a0

encoder GQANEOX rope support

19a0bdb

decoder a8w8c8 GQANEOX rope support

a9078cb

merge get_block_shape and split_kv_block

f64f962

bf16 neox rope support

7ba73f8

fix diff

6837c23

separate compilation

0a5ae96

manual destroy stream

e9cfc55

fix multi stream

478c517

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

aa1e96a

…nto append_attn

qwen/llama support weightonly

e8ddfe8

yuanlehome added 2 commits October 11, 2024 12:52

refine code

2292780

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

036fb73

…nto append_attn

yuanlehome mentioned this pull request Oct 11, 2024

[LLM INFER] Append attn Moved to https://github.com/PaddlePaddle/PaddleNLP/pull/9244 #9242

Closed

decoder neox_rope_c4 support

b85782d

lizhenyun01 and others added 5 commits October 11, 2024 20:22

instantiation of append_attn with float16

9814578

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

7a1f591

…nto append_attn

optimize cpu performance

5c126ad

format code

2ef7c11

c16/c8/c4 分离编译加快编译速度

4a4a4b4

yuanlehome force-pushed the append_attn branch from 4d69d01 to 4a4a4b4 Compare October 15, 2024 03:30

yuanlehome and others added 3 commits October 15, 2024 14:20

fix bug

0e35a1e

gqa_group_size -> kv_num_heads

c5b4633

support speculate_attn

ea8c07e

vivienfanghuagood reviewed Oct 16, 2024

View reviewed changes

adjust network

3789175

yuanlehome force-pushed the append_attn branch from 50a48de to 3789175 Compare October 16, 2024 08:20

yuanlehome and others added 7 commits October 16, 2024 09:17

cache_int4 -> cache_int4_zp

6eacbca

fix use_fake_parameter multi cards

358115d

fix speculate_decoder

30ac44c

delete comment

4011d89

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

7efff99

…nto append_attn

Merge branch 'append_attn' of https://github.com/yuanlehome/PaddleNLP …

c30c112

…into append_attn

fix ci

84a6864

yuanlehome force-pushed the append_attn branch from ed9da7a to 84a6864 Compare October 21, 2024 06:33

qingqing01 approved these changes Oct 22, 2024

View reviewed changes

ZHUI approved these changes Oct 23, 2024

View reviewed changes

ZHUI merged commit 31c6b9a into PaddlePaddle:develop Oct 23, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM INFER] Append attn #9244

[LLM INFER] Append attn #9244

yuanlehome commented Oct 11, 2024 •

edited

Loading

paddle-bot bot commented Oct 11, 2024

codecov bot commented Oct 11, 2024 •

edited

Loading

vivienfanghuagood Oct 16, 2024

yuanlehome Oct 16, 2024

ZHUI Oct 23, 2024

yuanlehome Oct 23, 2024

ZHUI Oct 23, 2024

yuanlehome Oct 23, 2024

ZHUI Oct 23, 2024

yuanlehome Oct 23, 2024

ZHUI Oct 23, 2024

ZHUI Oct 23, 2024

yuanlehome Oct 23, 2024

[LLM INFER] Append attn #9244

[LLM INFER] Append attn #9244

Conversation

yuanlehome commented Oct 11, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Oct 11, 2024

codecov bot commented Oct 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanlehome commented Oct 11, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading