[NPU] Add `mixed_precision` for Qwen2 7B #12098

Oscilloscope98 · 2024-09-20T06:54:37Z

Description

Support mixed_precision in from_pretrained function for NPU

If mixed_precision=True and load_in_low_bit='sym_int4', Qwen2 7B will use INT8 for lm_head
Model saved with mixed_precision=True/False will keep the same option when load_low_bit the saved model
Disable lm_head split when load_in_low_bit='sym_int8'
Update example accordingly

…en2-7B-Instruct

jason-dai

LGTM

python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md

rnwang04

others LGTM

python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md

Oscilloscope98 added 6 commits September 20, 2024 11:46

Add mix_precision argument to control whether use INT8 lm_head for Qw…

e9d6ba1

…en2-7B-Instruct

Small fix

7157cc7

Fixed on load low bit with mixed precision

3d1def8

Small fix

0dada1f

Update example accordingly

43164d1

Update for default prompt

5a17b67

Oscilloscope98 requested review from jason-dai and rnwang04 September 20, 2024 07:23

jason-dai approved these changes Sep 20, 2024

View reviewed changes

rnwang04 reviewed Sep 20, 2024

View reviewed changes

python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md Outdated Show resolved Hide resolved

Update base on comments

9ab579a

rnwang04 approved these changes Sep 20, 2024

View reviewed changes

rnwang04 reviewed Sep 20, 2024

View reviewed changes

python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md Outdated Show resolved Hide resolved

Final fix

c29eb59

Oscilloscope98 merged commit 828fa01 into intel-analytics:main Sep 20, 2024
1 check passed