-
Notifications
You must be signed in to change notification settings - Fork 375
Static quant support for SmoothQuant #3089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3089
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit abfce41 with merge base 4013764 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@Xia-Weiwen I think it's better to wait until the Int8Tensor migration is done |
Thanks for the info |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
| model = ( | ||
| AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) | ||
| AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: torch_dtype is deprecated; please check #2982 for more info
| tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
| model = ( | ||
| AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) | ||
| AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| torch.manual_seed(34) | ||
| w8a8_model = ( | ||
| AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) | ||
| AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here :)
| model_save_path: str, | ||
| model_save_hf_hub_path: str, | ||
| static_quant_act: bool, | ||
| compile: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share result for torch.compile with static-quant? Not sure for the reason, but it decreased Token/Sec within dynamic-quant, and discussed to remove at #2728 (comment) .
|
This PR is out of date. We need to use the new int8 tensor API. |
Summary
This PR adds static quant support for SmoothQuant by adding a new
Int8StaticActivationInt8WeightConfigconfiguration. Static quantization will generally have better latency & throughput than dynamic quant as it saves the overhead of runtime qparam selection.In the implementation:
SmoothQuantObserverreturns act scale along with the smoothing factor.Int8StaticActivationInt8WeightConfigfor transformation of each linear layer.Int8StaticActivationInt8WeightConfigis not suitable for general static quantization (although it works), users should use PT2E in that case. It's because the act scale for the config are global instead of per-linear-layer, which is the same asFloat8StaticActivationFloat8WeightConfigTest plan
This PR also updates the test cases for SmoothQuant: