-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autoquant using aqt #609
autoquant using aqt #609
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/609
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f5ac4bf with merge base 87869f2 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
d4d5abe
to
2623bab
Compare
# return weight | ||
|
||
# avoid circular dep | ||
from torchao.dtypes import to_affine_quantized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the code from quant_api.py? also we need to refactor the input_quant_func
to be a normal function (not lambda) in order for serialization to work I think, might help to do that refactor at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is that we need to call from_float with super in order to get this to work correctly. The code in quant_api would generate an aqt, not an autoquant class inheriting from an aqt. (I tried this approach initially since that's how it worked with subclass)
If we want to reuse the code, it may make sense to make a function that like prepares all the variables needed to go into from_float
def int8_weight_only_kwargs(weight):
...do the code up to to_affine_quantized and put it all together
return a_bunch_of_kwargs
then you could have the quant_api code be like
def int8_weight_only():
def apply_int8wo_quant(weight):
kwargs = int8_weight_only_kwargs(weight)
return to_affine_quantized(**kwargs)
return _get_linear_subclass_inserter(apply_int8wo_quant)
then in autoquant we could do similar
def from_float(weight):
kwargs = int8_weight_only_kwargs(weight)
super().from_float(**kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does move apply_int8wo_quant
function to top level help? I'm doing it here: https://github.com/pytorch/ao/pull/630/files, seems like you call the super().from_float after the aqt is produced right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move the apply quant to weight Tensor function to top level as well I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, the changes looks good to me, please fix CI before landing
# in_features = weight.shape[1] | ||
# int8 dynamic quantization only has benefit when in_feature > 16 | ||
# if in_features <= 16: | ||
# return weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these? should these be enabled or removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a todo
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
56b44aa
to
5832bf8
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary:
changing autoquant to use aqt instead of the old subclass subtensors changed aqt to first dispatch to a static _quantized_linear_op which then dispatches to the normal function. This way autoquant has an extention point to modify the kernel functions for various quantization modes without editing the main kernel function of all the classes. linear_activation_quantized_tensor got the same treatment.
there were some transposes found in the aqt kernels not present in the subclass kernels, however they do not seen to affect performance (see benchmark_results.txt for an autoquant perf run)
Test Plan:
sh benchmarks.sh
python test_integration.py
Reviewers:
Subscribers:
Tasks:
Tags: