Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

dukelee111 · 2024-09-18T02:23:09Z

Environment:
Platform: 6548N+1 ARC770
Docker Image:

servicing script:

Error info:
1.With compression weight SYM_INT4 failed.
2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.

Error log:
1.Servicing side error log:

hzjane · 2024-09-18T05:55:17Z

I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?

dukelee111 · 2024-09-18T07:03:18Z

It's not encountered when doing benchmark, starting vllm could succeed.

ACupofAir · 2024-09-18T08:02:37Z

Cannot reproduce
Steps:

start docker:

#!/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
export CONTAINER_NAME=junwang-vllm54-issue220

docker rm -f $CONTAINER_NAME
sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        --name=$CONTAINER_NAME \
        -v /home/intel/LLM:/llm/models/ \
        -v /home/intel/junwang:/workspace \
        -e no_proxy=localhost,127.0.0.1 \
        --shm-size="16g" \
        $DOCKER_IMAGE

start serve:

#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"

export no_proxy=localhost,127.0.0.1

source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8001 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 256 \
  -tp 1 \
  --max-num-seqs 64
  #-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8

curl script:

curl http://localhost:8001/v1/completions                 -H "Content-Type: application/json"             -d '{
                  "model": "Qwen1.5-14B-Chat",
                  "prompt": "San Francisco is a",
                  "max_tokens": 128
}'

result
1. offline
2. online

hzjane self-assigned this Sep 18, 2024

qiuxin2012 added the user issue label Sep 18, 2024

glorysdj added multi-arc labels Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

dukelee111 commented Sep 18, 2024

hzjane commented Sep 18, 2024 •

edited

Loading

dukelee111 commented Sep 18, 2024

ACupofAir commented Sep 18, 2024

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

Comments

dukelee111 commented Sep 18, 2024

hzjane commented Sep 18, 2024 • edited Loading

dukelee111 commented Sep 18, 2024

ACupofAir commented Sep 18, 2024

hzjane commented Sep 18, 2024 •

edited

Loading