Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

Open
kirkdort44 opened this issue Nov 24, 2024 · 0 comments

Comments

@kirkdort44
Copy link

kirkdort44 commented Nov 24, 2024

Hello,

I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:

  • Are weights always quantized per-channel (e.g., along the first dimension for layers)?
  • Are activations quantized per-tensor, applying a single scale across the entire tensor?
  • Are these settings consistent with the benchmarks mentioned above, or are there exceptions or additional considerations (e.g., support for other granularity levels)?

Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):

  • Does it statically determine the scales and zero-points for weights and activations during calibration?
  • Are there any dynamic adjustments post-calibration, or does the quantization remain static throughout inference?

Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant