Replace implementation for int8 dynamic quantization with call to `qu…

…antize` Summary: Previously we added `quantize` as a general API (pytorch#256) for Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general. The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant and 8da4w (for executorch). This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor subclass. We'll make sure the performance does not regress for vit model. Test Plan: TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py reference: elapsed_time: 1.4821058654785155 milliseconds after refactor: elapsed_time: 1.4804757690429688 milliseconds generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d Reviewers: Subscribers: Tasks: Tags:
jerryzh168 · May 31, 2024 · cd1ebc8 · cd1ebc8
1 parent 68ce5b8
commit cd1ebc8
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 1 deletion.
diff --git a/torchao/quantization/quant_api.py b/torchao/quantization/quant_api.py
@@ -37,6 +37,7 @@
     Int8WeightOnlyQuantizedLinearWeight,
     QuantizedLinearWeightBase,
     to_laq,
+    LinearActQuantizedTensor,
 )
 
 from .quant_primitives import (

diff --git a/tutorials/quantize_vit/run_vit_b_quant.py b/tutorials/quantize_vit/run_vit_b_quant.py
@@ -19,7 +19,7 @@
 inductorconfig.force_fuse_int_mm_with_mul = True
 ## Quantization code - end
 
-model = torch.compile(model, mode='max-autotune')
+model = torch.compile(model, mode='max-autotune', fullgraph=True)
 
 # Must run with no_grad when optimizing for inference
 with torch.no_grad():