Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

kirkdort44 · 2024-11-24T20:10:26Z

Hello,

I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:

Are weights always quantized per-channel (e.g., along the first dimension for layers)?
Are activations quantized per-tensor, applying a single scale across the entire tensor?
Are these settings consistent with the benchmarks mentioned above, or are there exceptions or additional considerations (e.g., support for other granularity levels)?

Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):

Does it statically determine the scales and zero-points for weights and activations during calibration?
Are there any dynamic adjustments post-calibration, or does the quantization remain static throughout inference?

Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

kirkdort44 commented Nov 24, 2024 •

edited

Loading

Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

Comments

kirkdort44 commented Nov 24, 2024 • edited Loading

kirkdort44 commented Nov 24, 2024 •

edited

Loading