-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM INFER] Append attn #9244
[LLM INFER] Append attn #9244
Conversation
Change-Id: Ibe8920ba41ea9775e676b05b12dc01cb9da95b5e
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9244 +/- ##
========================================
Coverage 52.73% 52.74%
========================================
Files 661 661
Lines 107422 107371 -51
========================================
- Hits 56653 56630 -23
+ Misses 50769 50741 -28 ☔ View full report in Codecov by Sentry. |
4d69d01
to
4a4a4b4
Compare
static_cast<uint8_t>(quant_value2 + 128.0f); | ||
} | ||
// write k | ||
// 大分块 lane_id / 4 / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
中文注释删一删
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
太多了,留着无伤大雅吧
50a48de
to
3789175
Compare
ed9da7a
to
84a6864
Compare
@@ -871,7 +873,7 @@ def set_state_dict(self, state_dict): | |||
weight_scales_loader = EmptyWeightScale( | |||
weight_scale_map_dict, | |||
num_of_layers=self.config.num_hidden_layers, | |||
num_head=self.num_attention_heads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个一直没有生效,会导致什么问题?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是调试时改的,仅仅改下命名,不会有什么影响
@@ -835,7 +838,10 @@ def set_state_dict(self, state_dict): | |||
|
|||
for k, v in cache_scales_loader.scale.items(): | |||
for i_layer, weight_scale in enumerate(v): | |||
weight_scale = weight_scale.astype("float32") | |||
if self.config.append_attn: | |||
weight_scale = paddle.to_tensor(weight_scale).cast(paddle.get_default_dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么 append_attn 下 可以不同 fp32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为kernel实现里是要求half精度的,访存 不同
@@ -1684,8 +1692,8 @@ def benchmark(predictor, predictor_args, model_args): | |||
batch_benchmark_texts = batchfy_text(benchmark_texts, predictor_args.batch_size) | |||
print("***********Start Benchmark**********") | |||
|
|||
warmup_time = 10 | |||
test_time = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个修改是为了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个修改没啥影响,没注意到提到commit上了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意下个pr恢复吧
super().__init__(config) | ||
self.max_seq_len = config.max_seq_len | ||
self.block_size = config.block_size | ||
|
||
def set_transformer_block(self, transformer_config): | ||
if self.use_weight_only: | ||
self.transformer_block = FusedBlockMultiTransformerWeightOnly(transformer_config) | ||
elif "a8w8" in self.quant_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个为什么删除?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为并没有支持,之前同学参考其他代码时一并copy过来了,我这里顺便给删掉
* refine paddle::empty(), fix memory error, support multi_stream for attention * fix and rename attention as append_attention * rename file --------- Co-authored-by: lizhenyun <lizhenyun@baidu.com> Co-authored-by: lizhenyun01 <1500424927@qq.com>
PR types
New features
PR changes
Others
Description
大模型推理attention组网重构,新的append_attn方案相比旧方案有10%到90%的性能提升。
目前已支持了llama/qwen/qwen-moe/mixtral的推理。
使用方式,原推理脚本的 --block_attn选项改为--append_attn即可。
TODO: