Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Docs ] Conceptual Guides #18

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic commented Jul 8, 2024

SUMMARY:

  • explanation of why quantization is useful
  • explanation of various quantization schemes
  • benchmarking utilities

TEST PLAN:

  • N/a

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title Rs/concepts [ Docs ] Conceptual Guides - Inference Acceleration from Quantization Jul 8, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title [ Docs ] Conceptual Guides - Inference Acceleration from Quantization [ Docs ] Conceptual Guides Jul 8, 2024

## Theory

Performing quantization to go from `float16` to `int8` (or lower) is tricky. Only 256 values can be represented in `int8`, while `float16` can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of `float32` values to the `int8` space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good space for a diagram (bucketing the weight distribution to 256 buckets)

docs/conceptual_guides/quantization_schemes.md Outdated Show resolved Hide resolved
docs/conceptual_guides/quantization_schemes.md Outdated Show resolved Hide resolved
docs/conceptual_guides/quantization_schemes.md Outdated Show resolved Hide resolved
docs/conceptual_guides/quantization_schemes.md Outdated Show resolved Hide resolved
docs/conceptual_guides/quantization_schemes.md Outdated Show resolved Hide resolved

* **Static quantization**: the range for each activation is computed in advance at quantization-time, typically by passing representative "calibration" data through the model and recording the activation values. In practice, we run a number of forward passes on a calibration dataset is done and compute the ranges according to the observed calibration data.

In general, it is best practice to start your experiments with:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it best practice?

docs/conceptual_guides/quantization_schemes.md Outdated Show resolved Hide resolved
robertgshaw2-neuralmagic and others added 5 commits July 15, 2024 17:55
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
* test forward (vllm-project#16)

* test frozen (vllm-project#17)

* test frozen

* rename

* lifecycle conftest (vllm-project#21)

* test initalize (vllm-project#18)

* test initalize

* newline

* parametrize weights and inp_act

* remove dup

* test lifecycle (vllm-project#19)

* test lifecycle

* comments

* comments

* add quantization test

* Lifecycle/min max obs (vllm-project#20)

* min max test

* add minmax obs

* test scale range and min_max update

* rebase

* rebase

* fix

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants