[ Docs ] Conceptual Guides #18

robertgshaw2-neuralmagic · 2024-07-08T01:43:32Z

SUMMARY:

explanation of why quantization is useful
explanation of various quantization schemes
benchmarking utilities

TEST PLAN:

N/a

…pressor into rs/concepts

bfineran · 2024-07-15T20:02:15Z

docs/conceptual_guides/quantization_schemes.md

+
+## Theory
+
+Performing quantization to go from `float16` to `int8` (or lower) is tricky. Only 256 values can be represented in `int8`, while `float16` can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of `float32` values to the `int8` space.


good space for a diagram (bucketing the weight distribution to 256 buckets)

docs/conceptual_guides/quantization_schemes.md

bfineran · 2024-07-15T20:07:04Z

docs/conceptual_guides/quantization_schemes.md

+
+* **Static quantization**: the range for each activation is computed in advance at quantization-time, typically by passing representative "calibration" data through the model and recording the activation values. In practice, we run a number of forward passes on a calibration dataset is done and compute the ranges according to the observed calibration data.
+
+In general, it is best practice to start your experiments with:


Why is it best practice?

docs/conceptual_guides/quantization_schemes.md

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

* test forward (vllm-project#16) * test frozen (vllm-project#17) * test frozen * rename * lifecycle conftest (vllm-project#21) * test initalize (vllm-project#18) * test initalize * newline * parametrize weights and inp_act * remove dup * test lifecycle (vllm-project#19) * test lifecycle * comments * comments * add quantization test * Lifecycle/min max obs (vllm-project#20) * min max test * add minmax obs * test scale range and min_max update * rebase * rebase * fix * fix

robertgshaw2-neuralmagic added 2 commits July 8, 2024 01:39

added examples

53750c2

added guides

1b90311

robertgshaw2-neuralmagic changed the title ~~Rs/concepts~~ [ Docs ] Conceptual Guides - Inference Acceleration from Quantization Jul 8, 2024

robertgshaw2-neuralmagic and others added 6 commits July 7, 2024 21:44

Update benchmark_offline.py

6e9f5bf

added quantization schemes

2d8ce6f

Merge branch 'rs/concepts' of https://github.com/vllm-project/llm-com…

97d7ae1

…pressor into rs/concepts

finished online serving benchmark on A10

6a551b2

cleanup

bf17a75

added offline batch

eafb1ed

robertgshaw2-neuralmagic changed the title ~~[ Docs ] Conceptual Guides - Inference Acceleration from Quantization~~ [ Docs ] Conceptual Guides Jul 8, 2024

robertgshaw2-neuralmagic added 9 commits July 8, 2024 03:07

save

44794bb

nit

3868193

nit

ccbe6db

nit

5789d9e

nits

386c455

format

3478161

format

2d36def

cleanup nits

bfa605c

more cleanup

ad4905e

bfineran reviewed Jul 15, 2024

View reviewed changes

robertgshaw2-neuralmagic and others added 5 commits July 15, 2024 17:55

Update docs/conceptual_guides/quantization_schemes.md

60708a8

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Update docs/conceptual_guides/quantization_schemes.md

68becd8

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Update docs/conceptual_guides/quantization_schemes.md

155a37b

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Update docs/conceptual_guides/quantization_schemes.md

832be25

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Update docs/conceptual_guides/quantization_schemes.md

70add53

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Docs ] Conceptual Guides #18

[ Docs ] Conceptual Guides #18

robertgshaw2-neuralmagic commented Jul 8, 2024 •

edited

Loading

bfineran Jul 15, 2024

bfineran Jul 15, 2024

robertgshaw2-neuralmagic Jul 15, 2024


		## Theory

		Performing quantization to go from `float16` to `int8` (or lower) is tricky. Only 256 values can be represented in `int8`, while `float16` can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of `float32` values to the `int8` space.


		* Static quantization: the range for each activation is computed in advance at quantization-time, typically by passing representative "calibration" data through the model and recording the activation values. In practice, we run a number of forward passes on a calibration dataset is done and compute the ranges according to the observed calibration data.

		In general, it is best practice to start your experiments with:

[ Docs ] Conceptual Guides #18

Are you sure you want to change the base?

[ Docs ] Conceptual Guides #18

Conversation

robertgshaw2-neuralmagic commented Jul 8, 2024 • edited Loading

bfineran Jul 15, 2024

Choose a reason for hiding this comment

bfineran Jul 15, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Jul 15, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Jul 8, 2024 •

edited

Loading