[Inference] Add the quant_linear_fuse_pass #58637

Wanglongzhi2001 · 2023-11-03T02:05:39Z

PR types

New features

PR changes

APIs

Description

使得下图所示计算图中的quant dequant weight dequant 和 matmul_v2 以及 elementwise_add 融合成 quant_linear 以便进行 int8 的推理

融合后结果：

目前实现方法：
先直接调用 delete_quant_dequant_linear_op_pass 和 delete_weight_dequant_linear_op_pass 的内容把 quant dequant 以及 weight_dequant 删掉，然后再调用自己的实现来融合剩下的 matmul_v2 以及 elementwise_add 算子变成 quant_linear

想请教的地方：

我目前这样实现把所有的 quant dequant op (包括不是在 matmul_v2 前面的) 都删掉再做融合这样是否有影响？
我目前实现 matmul_v2 和 element_wise 算子的融合是针对了 matmul_v2 算子，没有考虑到 mul 算子，因为我看有一些其他的 pass 会有将 matmul_v2 转换为 mul 算子的行为，但是目前应该是只有这个 matmul_v2 算子有 int8 的实现，所以请问这样是否有影响？
目前运行 pass 后我的 quant_linear 节点只有以下一个 attribute

我不大清楚哪些 attributes 是必须的，我看 quant_linear op 的实现里有这些 attributes，请问是否这些 attributes 都要添加？

paddle-bot · 2023-11-03T02:05:44Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Wanglongzhi2001 · 2023-11-03T02:19:27Z

@RichardWooSJTU 吴哥麻烦帮忙看看

RichardWooSJTU · 2023-11-07T11:57:49Z

我目前这样实现把所有的 quant dequant op (包括不是在 matmul_v2 前面的) 都删掉再做融合这样是否有影响？

A: 有影响，对于含有conv算子的模型也都删掉了quant/dequant，但是并没有其他pass执行后续操作

我目前实现 matmul_v2 和 element_wise 算子的融合是针对了 matmul_v2 算子，没有考虑到 mul 算子，因为我看有一些其他的 pass 会有将 matmul_v2 转换为 mul 算子的行为，但是目前应该是只有这个 matmul_v2 算子有 int8 的实现，所以请问这样是否有影响？

A: 你的pass将matmul_v2已经删除了，并融合到了quant_linear里面。

我不大清楚哪些 attributes 是必须的

A: 每一个都是必须的，只不过会有默认值。这需要你理解每一个attr的含义及作用

Wanglongzhi2001 · 2023-11-07T12:00:08Z

我目前这样实现把所有的 quant dequant op (包括不是在 matmul_v2 前面的) 都删掉再做融合这样是否有影响？

A: 有影响，对于含有conv算子的模型也都删掉了quant/dequant，但是并没有其他pass执行后续操作

我目前实现 matmul_v2 和 element_wise 算子的融合是针对了 matmul_v2 算子，没有考虑到 mul 算子，因为我看有一些其他的 pass 会有将 matmul_v2 转换为 mul 算子的行为，但是目前应该是只有这个 matmul_v2 算子有 int8 的实现，所以请问这样是否有影响？

A: 你的pass将matmul_v2已经删除了，并融合到了quant_linear里面。

我不大清楚哪些 attributes 是必须的

A: 每一个都是必须的，只不过会有默认值。这需要你理解每一个attr的含义及作用

好的～

RichardWooSJTU · 2023-11-07T12:02:05Z

paddle/fluid/framework/ir/graph_pattern_detector.cc

+    return elementwise_add_out_var;
+  }
+}
+


这里还是建议匹配含有quant/dequant算子的完整pattern

vivienfanghuagood · 2023-11-07T12:05:05Z

paddle/fluid/framework/ir/quant_linear_fuse_pass.cc

+
+
+QuantLinearFusePass::QuantLinearFusePass() {
+  AddOpCompat(OpCompat("matmul_v2"))


应该是匹配matmul还是matmulv2？

计算图中显示的 op 是 matmul_v2，并且结果也表明是成功匹配到了，所以应该是 matmul_v2 吧？

Wanglongzhi2001 · 2023-11-10T01:47:23Z

@RichardWooSJTU 吴哥麻烦再 review 一下

zhoutianzi666 · 2023-11-21T03:06:46Z

你好，有在具体的模型上测试下这个 pass 能带来的性能提升吗？

Wanglongzhi2001 · 2023-11-21T03:21:29Z

你好，有在具体的模型上测试下这个 pass 能带来的性能提升吗？

有在 bert 模型下测过 Paddle Inference 下原生 fp32 推理（不运行任何pass, 输入的大小为 (batch_size, 512)，测试方法为先进行三次的 warm up ，然后运行十次取平均结果）以及只运行这一个 pass 的运行时间对比：

Wanglongzhi2001 · 2023-11-21T03:26:27Z

paddle/fluid/framework/ir/quant_linear_fuse_pass.cc

+
+    for (int i = 0; i < weight_tensor->dims()[1]; ++i) {
+      scale_weights[i] = 1.0f / weight_scale_data[i];
+    }


我觉得这里拿到 weight 的 quant scale 用 for 循环的话会降低 pass 的性能，但是在目前只有 weight 的 dequant scale 的情况下，感觉也只能这样拿到 quant 的 scale 了。@zhoutianzi66

RichardWooSJTU · 2023-11-21T03:31:07Z

你好，有在具体的模型上测试下这个 pass 能带来的性能提升吗？

有在 bert 模型下测过 Paddle Inference 下原生 fp32 推理（不运行任何pass, 输入的大小为 (batch_size, 512)，测试方法为先进行三次的 warm up ，然后运行十次取平均结果）以及只运行这一个 pass 的运行时间对比：

上面的问题不是说pass的运行时间而是推理的运行时间，pass时间不算在内；

Wanglongzhi2001 · 2023-11-21T03:37:56Z

你好，有在具体的模型上测试下这个 pass 能带来的性能提升吗？

有在 bert 模型下测过 Paddle Inference 下原生 fp32 推理（不运行任何pass, 输入的大小为 (batch_size, 512)，测试方法为先进行三次的 warm up ，然后运行十次取平均结果）以及只运行这一个 pass 的运行时间对比：

上面的问题不是说pass的运行时间而是推理的运行时间，pass时间不算在内；

我测的是推理时间应该没错，从 predictor.run() 前面计时再到 predictor.run() 结束再次计时，对比的是任何 Pass 都不用的推理时间和只用这一个 pass 之后的推理时间。

qili93

LGTM for @unittest.skipIf

RichardWooSJTU · 2023-11-21T03:48:27Z

你好，有在具体的模型上测试下这个 pass 能带来的性能提升吗？

有在 bert 模型下测过 Paddle Inference 下原生 fp32 推理（不运行任何pass, 输入的大小为 (batch_size, 512)，测试方法为先进行三次的 warm up ，然后运行十次取平均结果）以及只运行这一个 pass 的运行时间对比：

上面的问题不是说pass的运行时间而是推理的运行时间，pass时间不算在内；

我测的是推理时间应该没错，从 predictor.run() 前面计时再到 predictor.run() 结束再次计时，对比的是任何 Pass 都不用的推理时间和只用这一个 pass 之后的推理时间。

嗯嗯所以说和pass的实现没有关系，性能只和算子相关。看起来在大batch下还不如fake量化？可能需要开启一下kernel选择用于算子加速

在python中加入
paddle.base.core.enable_autotune()
paddle.base.core.update_autotune_status()

在环境变量上加入
export FLAGS_use_autotune=1
export FLAGS_cublaslt_exhaustive_search_times=10

Wanglongzhi2001 · 2023-11-21T04:14:25Z

你好，有在具体的模型上测试下这个 pass 能带来的性能提升吗？

有在 bert 模型下测过 Paddle Inference 下原生 fp32 推理（不运行任何pass, 输入的大小为 (batch_size, 512)，测试方法为先进行三次的 warm up ，然后运行十次取平均结果）以及只运行这一个 pass 的运行时间对比：

上面的问题不是说pass的运行时间而是推理的运行时间，pass时间不算在内；

我测的是推理时间应该没错，从 predictor.run() 前面计时再到 predictor.run() 结束再次计时，对比的是任何 Pass 都不用的推理时间和只用这一个 pass 之后的推理时间。

嗯嗯所以说和pass的实现没有关系，性能只和算子相关。看起来在大batch下还不如fake量化？可能需要开启一下kernel选择用于算子加速

在python中加入 paddle.base.core.enable_autotune() paddle.base.core.update_autotune_status()

在环境变量上加入 export FLAGS_use_autotune=1 export FLAGS_cublaslt_exhaustive_search_times=10

好的

qili93

LGTM for @unittest.skipIf

* add the quant_linear_fuse_pass

Wanglongzhi2001 · 2024-02-05T04:35:28Z

在python中加入
paddle.base.core.enable_autotune()
paddle.base.core.update_autotune_status()

在环境变量上加入
export FLAGS_use_autotune=1
export FLAGS_cublaslt_exhaustive_search_times=10

添加环境变量后，分别测试在原生 fp32 和原生 int8 下的的测试结果如下：
模型：macbert-large-chinese（3.35亿参数)
测试环境：A100

由于原生 fp32 有一些例如 multihead_matmul_fuse 等额外加速推理的算子融合 pass 在当前这个 int8 融合算子后不支持，所以当前 int8 的加速只是纯算子运算速度层面的加速，后续还有很多 pass 层面的加速空间。

paddle-bot bot added the contributor External developers label Nov 3, 2023

add the quant_linear_fuse_pass

82c1f0d

RichardWooSJTU reviewed Nov 7, 2023

View reviewed changes

vivienfanghuagood reviewed Nov 7, 2023

View reviewed changes

Wanglongzhi2001 mentioned this pull request Nov 8, 2023

[WeeklyReports] 2023.10.25~2023.11.07 周报汇总 PFCCLab/Camp#54

Closed

22 tasks

Wanglongzhi2001 force-pushed the develop branch from 63409ad to 82c1f0d Compare November 9, 2023 13:24

refactor: refactor quant_linear_fuse_pass

4bd765f

Wanglongzhi2001 added 12 commits November 13, 2023 07:08

refactor: refactor quant_linear_fuse_pass

4fab642

add the test of quant_linear_fuse_pass

f9e2837

fix: fix quant_linear_fuse_pass

e963c82

refactor: refactor pass and test

bdd2cc4

fix: fix test

7fc4b3b

fix: fix the bug of auto_scan_test

efba939

refactor the number of test case and auto_scan_test

67e22a0

refactor: refactor test and auto_scan_test

6845b7f

refactor: reduce the execute time of test

a565e5d

refactor: reduce test exec time

73a3292

fix: fix test

0f3f6c0

revise the timeout of the pass

3a4606f

Wanglongzhi2001 force-pushed the develop branch from 7160228 to 3a4606f Compare November 17, 2023 15:54

Wanglongzhi2001 added 2 commits November 20, 2023 07:08

refactor: refactor test

fddae09

reduce the max_examples

b2b454b

Wanglongzhi2001 commented Nov 21, 2023

View reviewed changes

qili93 previously approved these changes Nov 21, 2023

View reviewed changes

refactor: refactor the error message

9a05406

Wanglongzhi2001 dismissed qili93’s stale review via 9a05406 November 21, 2023 07:17

fix ci: replace the deprecated mutable_data

ece1e49

qili93 approved these changes Nov 22, 2023

View reviewed changes

yuanlehome merged commit 6747af6 into PaddlePaddle:develop Nov 22, 2023
28 checks passed

yuanlehome changed the title ~~add the quant_linear_fuse_pass~~ [Inference] Add the quant_linear_fuse_pass Nov 22, 2023

Wanglongzhi2001 mentioned this pull request Nov 22, 2023

[WeeklyReports] 2023.11.08~2023.11.21 周报汇总 PFCCLab/Camp#77

Closed

21 tasks

SecretXV pushed a commit to SecretXV/Paddle that referenced this pull request Nov 28, 2023

[Inference] Add the quant_linear_fuse_pass (PaddlePaddle#58637)

0cf16dd

* add the quant_linear_fuse_pass

vivienfanghuagood mentioned this pull request Feb 5, 2024

Custom Device 如何支持量化 #61516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Add the quant_linear_fuse_pass #58637

[Inference] Add the quant_linear_fuse_pass #58637

Wanglongzhi2001 commented Nov 3, 2023 •

edited

Loading

paddle-bot bot commented Nov 3, 2023

Wanglongzhi2001 commented Nov 3, 2023

RichardWooSJTU commented Nov 7, 2023 •

edited

Loading

Wanglongzhi2001 commented Nov 7, 2023

RichardWooSJTU Nov 7, 2023

Wanglongzhi2001 Nov 7, 2023

vivienfanghuagood Nov 7, 2023

Wanglongzhi2001 Nov 7, 2023

Wanglongzhi2001 commented Nov 10, 2023

zhoutianzi666 commented Nov 21, 2023

Wanglongzhi2001 commented Nov 21, 2023 •

edited

Loading

Wanglongzhi2001 Nov 21, 2023

RichardWooSJTU commented Nov 21, 2023

Wanglongzhi2001 commented Nov 21, 2023 •

edited

Loading

qili93 left a comment

RichardWooSJTU commented Nov 21, 2023 •

edited

Loading

Wanglongzhi2001 commented Nov 21, 2023

qili93 left a comment

Wanglongzhi2001 commented Feb 5, 2024 •

edited

Loading



		QuantLinearFusePass::QuantLinearFusePass() {
		AddOpCompat(OpCompat("matmul_v2"))

[Inference] Add the quant_linear_fuse_pass #58637

[Inference] Add the quant_linear_fuse_pass #58637

Conversation

Wanglongzhi2001 commented Nov 3, 2023 • edited Loading

PR types

PR changes

Description

想请教的地方：

paddle-bot bot commented Nov 3, 2023

Wanglongzhi2001 commented Nov 3, 2023

RichardWooSJTU commented Nov 7, 2023 • edited Loading

Wanglongzhi2001 commented Nov 7, 2023

RichardWooSJTU Nov 7, 2023

Choose a reason for hiding this comment

Wanglongzhi2001 Nov 7, 2023

Choose a reason for hiding this comment

vivienfanghuagood Nov 7, 2023

Choose a reason for hiding this comment

Wanglongzhi2001 Nov 7, 2023

Choose a reason for hiding this comment

Wanglongzhi2001 commented Nov 10, 2023

zhoutianzi666 commented Nov 21, 2023

Wanglongzhi2001 commented Nov 21, 2023 • edited Loading

Wanglongzhi2001 Nov 21, 2023

Choose a reason for hiding this comment

RichardWooSJTU commented Nov 21, 2023

Wanglongzhi2001 commented Nov 21, 2023 • edited Loading

qili93 left a comment

Choose a reason for hiding this comment

RichardWooSJTU commented Nov 21, 2023 • edited Loading

Wanglongzhi2001 commented Nov 21, 2023

qili93 left a comment

Choose a reason for hiding this comment

Wanglongzhi2001 commented Feb 5, 2024 • edited Loading

Wanglongzhi2001 commented Nov 3, 2023 •

edited

Loading

RichardWooSJTU commented Nov 7, 2023 •

edited

Loading

Wanglongzhi2001 commented Nov 21, 2023 •

edited

Loading

Wanglongzhi2001 commented Nov 21, 2023 •

edited

Loading

RichardWooSJTU commented Nov 21, 2023 •

edited

Loading

Wanglongzhi2001 commented Feb 5, 2024 •

edited

Loading