Skip to content

Commit 02a108a

Browse files
committed
Describe weights comression option in the documentation
1 parent 786c6e5 commit 02a108a

File tree

1 file changed

+9
-1
lines changed

1 file changed

+9
-1
lines changed

use_with_openvino.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ python3 benchmark_serving.py --backend openai --endpoint /v1/completions --port
5252
```
5353

5454

55-
## Use vLLM offline
55+
## Use vLLM offline
5656

5757
_All below steps assume you are in `vllm` root directory._
5858

@@ -82,3 +82,11 @@ docker run --rm -it --entrypoint python3 -v $HOME/.cache/huggingface:/root/.cach
8282
# --num-prompts <number of requests to send> (default: 1000)
8383
# --swap-space <GiB for KV cache> (default: 50)
8484
```
85+
86+
## Use Int-8 Weights Compression
87+
88+
Weights int-8 compression is disabled by default. For better performance and lesser memory consumption, the weights compression can be enabled by setting the environment variable `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1`.
89+
To pass the variable in docker, use `-e VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1` as an additional argument to `docker run` command in the examples above.
90+
91+
The variable enables weights compression logic described in [optimum-intel 8-bit weights quantization](https://huggingface.co/docs/optimum/intel/optimization_ov#8-bit).
92+
Hence, even if the variable is enabled, the compression is applied only for models starting with a certain size and avoids compression of too small models due to a significant accuracy drop.

0 commit comments

Comments
 (0)