You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following pytorch/pytorch#153019 requests, we enable awq-uint4 for Intel GPU in pytorch/ao after RTN ready.
How to run awq quantization model:
```markdown
cd torchao/prototype/awq
python example.py --device xpu huggingface-model(such as meta-llama/Llama-3.1-8B-Instruct) awq-uint4-128
```
#Results of meta-llama/Llama-3.1-8B-Instruct on Intel GPU:
{'perplexity': {'perplexity': 10.099576950073242, 'prediction_time': 0.20489671968780787}}
#Results of meta-llama/Llama-3.1-8B-Instruct on NVIDIA-A100 GPU:
Results: {'perplexity': {'perplexity': 10.160041809082031, 'prediction_time': 0.4466673863672577}}
Pull Request resolved: #2248
Approved by: https://github.com/liangan1, https://github.com/jerryzh168
Copy file name to clipboardExpand all lines: torchao/prototype/awq/api.py
+12-3Lines changed: 12 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -5,12 +5,15 @@
5
5
# LICENSE file in the root directory of this source tree.
6
6
importtypes
7
7
fromdataclassesimportdataclass
8
+
fromtypingimportOptional
8
9
9
10
importtorch
10
11
11
12
importtorchao
12
13
fromtorchao.core.configimportAOBaseConfig
13
14
fromtorchao.dtypesimport (
15
+
Int4XPULayout,
16
+
Layout,
14
17
TensorCoreTiledLayout,
15
18
to_affine_quantized_intx,
16
19
)
@@ -105,12 +108,14 @@ class AWQUIntXConfig(AOBaseConfig):
105
108
106
109
Args:
107
110
quant_dtype: The data type of the quantized weights. Currently only torch.uint4 is intended to be used but can be used with torch.uint1 -> torch.uint8
111
+
`layout`: layout type for quantized tensor, default is `TensorCoreTiledLayout(inner_k_tiles=8)`
108
112
group_size: Quantization granularity. Use -1 for channel wise quantization
109
113
weight_quant_fn: The quantization function to be used, which takes in the weight and returns the quantized weight. If None, then affine uint4 quantization is used
110
114
set_inductor_config: if True, adjusts `torchinductor` settings to recommended values.
0 commit comments