-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Docs ] Conceptual Guides #18
base: main
Are you sure you want to change the base?
Conversation
|
||
## Theory | ||
|
||
Performing quantization to go from `float16` to `int8` (or lower) is tricky. Only 256 values can be represented in `int8`, while `float16` can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of `float32` values to the `int8` space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good space for a diagram (bucketing the weight distribution to 256 buckets)
|
||
* **Static quantization**: the range for each activation is computed in advance at quantization-time, typically by passing representative "calibration" data through the model and recording the activation values. In practice, we run a number of forward passes on a calibration dataset is done and compute the ranges according to the observed calibration data. | ||
|
||
In general, it is best practice to start your experiments with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it best practice?
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
* test forward (vllm-project#16) * test frozen (vllm-project#17) * test frozen * rename * lifecycle conftest (vllm-project#21) * test initalize (vllm-project#18) * test initalize * newline * parametrize weights and inp_act * remove dup * test lifecycle (vllm-project#19) * test lifecycle * comments * comments * add quantization test * Lifecycle/min max obs (vllm-project#20) * min max test * add minmax obs * test scale range and min_max update * rebase * rebase * fix * fix
SUMMARY:
TEST PLAN: