Skip to content

Commit f35263f

Browse files
Merge pull request vllm-project#25 from slyalin/disable_int8
Disable weight compression on optimum-intel conversion path
2 parents f73cfd2 + 02a108a commit f35263f

File tree

2 files changed

+11
-1
lines changed

2 files changed

+11
-1
lines changed

use_with_openvino.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ python3 benchmark_serving.py --backend openai --endpoint /v1/completions --port
5252
```
5353

5454

55-
## Use vLLM offline
55+
## Use vLLM offline
5656

5757
_All below steps assume you are in `vllm` root directory._
5858

@@ -82,3 +82,11 @@ docker run --rm -it --entrypoint python3 -v $HOME/.cache/huggingface:/root/.cach
8282
# --num-prompts <number of requests to send> (default: 1000)
8383
# --swap-space <GiB for KV cache> (default: 50)
8484
```
85+
86+
## Use Int-8 Weights Compression
87+
88+
Weights int-8 compression is disabled by default. For better performance and lesser memory consumption, the weights compression can be enabled by setting the environment variable `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1`.
89+
To pass the variable in docker, use `-e VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1` as an additional argument to `docker run` command in the examples above.
90+
91+
The variable enables weights compression logic described in [optimum-intel 8-bit weights quantization](https://huggingface.co/docs/optimum/intel/optimization_ov#8-bit).
92+
Hence, even if the variable is enabled, the compression is applied only for models starting with a certain size and avoids compression of too small models due to a significant accuracy drop.

vllm/model_executor/openvino_model_loader.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -599,10 +599,12 @@ def get_model(model_config: ModelConfig,
599599
else:
600600
print(f'[ INFO ] OpenVINO IR is avaialble for provided model id {model_config.model}. '
601601
'This IR will be used for inference as-is, all possible options that may affect model conversion are ignored.')
602+
load_in_8bit = None if os.environ.get('VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS', '0') == '1' else False
602603
pt_model = OVModelForCausalLM.from_pretrained(
603604
model_config.model,
604605
export=export,
605606
compile=False,
607+
load_in_8bit=load_in_8bit,
606608
trust_remote_code=model_config.trust_remote_code
607609
)
608610
patch_stateful_model(pt_model.model, kv_cache_dtype, device_config.device.type == "cpu")

0 commit comments

Comments
 (0)