Skip to content

Commit

Permalink
Replace implementation for int8 dynamic quantization with call to `qu…
Browse files Browse the repository at this point in the history
…antize`

Summary:
Previously we added `quantize` as a general API (pytorch#256) for
Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general.

The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant
and 8da4w (for executorch).

This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor
subclass. We'll make sure the performance does not regress for vit model.

Test Plan:
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py

reference: elapsed_time:  1.4821058654785155  milliseconds
after refactor: elapsed_time:  1.4804757690429688  milliseconds

generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d

Reviewers:

Subscribers:

Tasks:

Tags:
  • Loading branch information
jerryzh168 committed May 31, 2024
1 parent 68ce5b8 commit cd1ebc8
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
1 change: 1 addition & 0 deletions torchao/quantization/quant_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
Int8WeightOnlyQuantizedLinearWeight,
QuantizedLinearWeightBase,
to_laq,
LinearActQuantizedTensor,
)

from .quant_primitives import (
Expand Down
2 changes: 1 addition & 1 deletion tutorials/quantize_vit/run_vit_b_quant.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
inductorconfig.force_fuse_int_mm_with_mul = True
## Quantization code - end

model = torch.compile(model, mode='max-autotune')
model = torch.compile(model, mode='max-autotune', fullgraph=True)

# Must run with no_grad when optimizing for inference
with torch.no_grad():
Expand Down

0 comments on commit cd1ebc8

Please sign in to comment.