You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:
Are weights always quantized per-channel (e.g., along the first dimension for layers)?
Are activations quantized per-tensor, applying a single scale across the entire tensor?
Are these settings consistent with the benchmarks mentioned above, or are there exceptions or additional considerations (e.g., support for other granularity levels)?
Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):
Does it statically determine the scales and zero-points for weights and activations during calibration?
Are there any dynamic adjustments post-calibration, or does the quantization remain static throughout inference?
Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.
The text was updated successfully, but these errors were encountered:
Hello,
I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:
Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):
Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.
The text was updated successfully, but these errors were encountered: