Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert模型在Intel CPU上量化预测精度异常 #36962

Closed
yghstill opened this issue Nov 3, 2021 · 20 comments
Closed

Bert模型在Intel CPU上量化预测精度异常 #36962

yghstill opened this issue Nov 3, 2021 · 20 comments
Assignees
Labels
Milestone

Comments

@yghstill
Copy link
Contributor

yghstill commented Nov 3, 2021

  • 版本、环境信息:
       1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID
       2)CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
       3)GPU:-
       4)系统环境:ubuntu 16,Python 3.7

  • 复现信息:如为报错,请给出复现环境、复现步骤
    (1)得到离线量化后的模型后,使用附件中save_quant_model.py对量化模型进行转换和优化操作导出Intel CPU上可以使用的量化模型
    (2)使用PaddleInference 部署量化模型:运行cpu_infer.py执行预测 (注意修改代码中设置的模型等路径),预测精度如下图,正确acc应该0.5以上,不符预期:

  • 问题描述:我们在PaddlePaddle中适配支持Bert模型时,模型动转静后,再离线量化,模型在TRT int8上精度正常,在Intel CPU上预测流程可以跑通,但是精度异常
    image

执行代码在code.zip

code.zip

@paddle-bot-old
Copy link

paddle-bot-old bot commented Nov 3, 2021

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@lidanqing-intel
Copy link
Contributor

@yghstill 收到

@wozna
Copy link
Contributor

wozna commented Nov 3, 2021

@yghstill In the cpu_infer.py file, the module task_distill_zh is called, which is not available. Where can I find it?

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Nov 4, 2021

@wanghaoshuang
We can not reproduce the issue, error: ModuleNotFoundError: No module named 'task_distill_zh', could you please help? Thanks !

(py3.7) li@c3gpo:~/repo/Issue_Bert_36962$ python cpu_infer.py --model_path=$PWD/tnews_quant_models/model
Traceback (most recent call last):
  File "cpu_infer.py", line 26, in <module>
    from task_distill_zh import convert_example, METRIC_CLASSES, MODEL_CLASSES
ModuleNotFoundError: No module named 'task_distill_zh'
(py3.7) li@c3gpo:~/repo/Issue_Bert_36962$ pip install task_distill_zh
ERROR: Could not find a version that satisfies the requirement task_distill_zh
ERROR: No matching distribution found for task_distill_zh

@yghstill
Copy link
Contributor Author

yghstill commented Nov 4, 2021

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Nov 4, 2021

@yghstill
I followed LiuChiachi's repo, but still, following link is not accessible for us from outside Baidu. https://paddlenlp.bj.bcebos.com/models/transformers/community/./quant_models/vocab.txt

[2021-11-04 03:03:40,990] [   ERROR] - Downloading from INFO:paddle.utils.download:unique_endpoints {''}
[2021-11-04 03:03:37,987] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/community/./quant_models/vocab.txt and saved to /home/li/.paddlenlp/models/./quant_models
[2021-11-04 03:03:37,987] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/community/./quant_models/vocab.txt
[2021-11-04 03:03:40,990] [   ERROR] - Downloading from https://paddlenlp.bj.bcebos.com/models/transformers/community/./quant_models/vocab.txt failed with code 404!
Traceback (most recent call last):

wget also could not download it.

wget https://paddlenlp.bj.bcebos.com/models/transformers/community/./quant_models/vocab.txt
2021-11-04 08:43:29 ERROR 404: Not Found.

but wget could download this link

wget https://paddle-inference-lib.bj.bcebos.com/2.1.0-cpu-avx-mkl/paddle_inference.tgz

Any suggestion? Maybe consult paddlepaddle framework team, we had more conversation with them maybe they know how to deal with it ?

@yghstill
Copy link
Contributor Author

yghstill commented Nov 4, 2021

@lidanqing-intel Please modify quant_models to tnewst_quant_models in cpu_infer.py
image

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Nov 5, 2021

@yghstill Reproduced the issue.
@wozna

python save_quant_model.py --quant_model_path=$PWD/tnewst_quant_models/ --int8_model_save_path=INT8_PATH --ops_to_quantize="matmul,reshape,transpose"
matmul,reshape,transpose

In the ir passes, only this is fused, error could only be here.

Fused 12 ReshapeTransposeMatmulMkldnn patterns with reshape's xshape with transpose's xshape

Inference results

acc: 0.1053, time:  44.53691840171814

Maybe

  1. scales are collected and calculated wrongly
  2. [Check first] original model with mkldnn ON also had same wrong accuracy. Then fuse pass is wrong.
  3. matmu_v2 is also quantized, but we haven't really verify its accuracy.
    image

@lidanqing-intel lidanqing-intel added this to the Q4 milestone Nov 5, 2021
@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Nov 8, 2021

Update:
Have tested fp32 w oneDNN ON work with good accuracy and perf, it is collecting and propogating int8 scales for matmul_v2 step get wrong.
Note: tuen oneDNN on need to turn on ir_pass, otherwise fuse will not be executed, the demo config set if false, which should not be.

@sfraczek
Copy link
Contributor

We have found that we need to add dequantization of those matmul_v2 weights in quant2_int8_mkldnn_pass.

Can we assume that matmul_v2 weights are always assigned to input Y? Never to input X? Or should be assume that either X or Y can be weights?

@lidanqing-intel
Copy link
Contributor

@yghstill Please answer above sfraczek's question, cause we may deliver the model before 20th

@lidanqing-intel
Copy link
Contributor

@yghstill We also need a quant_model that is quant-aware trained with elementwise_add in the quantize_op_types during quant-aware training, This page we will update, we will add elementwise_add in the README.md.
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/mkldnn_quant

image
This is wrong (inadequate), it should be

所以,在使用PaddleSlim量化训练时,只需要对 depthwise_conv2d, conv2d, mul, matmul, elementwise_add 进行量化,不支持其他op。

@lidanqing-intel
Copy link
Contributor

@sfraczek
matmul_v2 Y could be weights or output of another op. X is always another op's output

@lidanqing-intel
Copy link
Contributor

  1. Should we enable all fuses of mul, matmul_v1, fc for matmul_v2
  2. Could we just keep matmul_v2 as the only interface and mapping it into different ops: mul, matmul_v1, fc according to different situations

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 1, 2021

Those mappings , like matmul_v2->matmul_v1, matmul_v2->mul, matmul_v2->fc, will they be remaining ?
Yes

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 1, 2021

只需要matmul_v2 只考虑matmul_v1的所有pass,而不去考虑mul之类的。然后mamtul_v1算子不会被废弃。但是matmul_v2->matmul_v1 的 pass 会废弃。
最后solution: 能走matmul_v1的直接走v1,不能走的再走matmul_v2。matmul_v2->matmul_v1 这个 pass 会保留。

@sfraczek
Copy link
Contributor

sfraczek commented Dec 9, 2021

Intel i9 accuracy and performance
tnewst_quant_models: acc: 0.5351, time: 71.88971424102783
bert_fp32: acc: 0.5467, time: 46.44249224662781
bert_int8 acc: 0.5383, time: 36.17305397987366
fc_gelu acc: 0.5383, time: 34.68397045135498 (Perf on i9, should be faster on 6271)

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 17, 2021

6 transformers
Bert FP32/INT8 tested on Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz

Performance (fps) on 6271 Native FP32 oneDNN FP32 oneDNN INT8 oneDNN INT8/oneDNN FP32 oneDNN INT8/Native FP32
thread 1 313.53 280.13 128.82 2.17X 2.43X
thread 6 98.05 55.25 30.16 1.83X 3.25X
Accuracy on 6271 tnewst_quant_model native without oneDNN FP32 oneDNN FP32 oneDNN INT8
thread 1/6 0.5351 0.5567 0.5567 0.5376
  • Paddle commit b0d12d9931f34294cdb50e9893287dc3deea5c60
  • config.switch_ir_optim(True) // Note: the original cpu_infer.py here is False, but that is wrong.
  • How to save int8 model (These default ops will be added into Paddle in next PR)
python3.7 save_quant_model.py --quant_model_path=$PWD/tnewst_quant_models/ --int8_model_save_path=INT8_PATH --ops_to_quantize="matmul,reshape,transpose,fc,matmul_v2"
  • oneDNN config settings in cpu_infer.py
    def create_predictor(cls, args):
        config = paddle.inference.Config(args.model_path + ".pdmodel",
                                         args.model_path + ".pdiparams")
#        config = paddle.inference.Config(args.model_path + "__model__",
#                                         args.model_path + "__params__")
            # set CPU configs accordingly,
            # such as enable_mkldnn, set_cpu_math_library_num_threads
        print("--------------------> choose cpu mode.")
        config.disable_gpu()
        config.enable_mkldnn()
        config.set_cpu_math_library_num_threads(6)
        # Note, Here is True, not false.
        config.switch_ir_optim(True)
        config.enable_memory_optim()
        config.switch_use_feed_fetch_ops(False)

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 17, 2021

@yghstill @sfraczek Hi, performance and accuracy are benchmarked, you can verify @yghstill.

Because we are approaching the end of Q4. If there are more requirements, we will do next quarter. Some possible directions:

  1. fc+elementwise_add
  2. will check more

@yghstill
Copy link
Contributor Author

yghstill commented Dec 23, 2021

@lidanqing-intel @sfraczek Verified. Thanks for the above optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants