Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

Open
dukelee111 opened this issue Sep 18, 2024 · 3 comments

Comments

@dukelee111
Copy link

Environment:
Platform: 6548N+1 ARC770
Docker Image:
image
servicing script:
image

Error info:
1.With compression weight SYM_INT4 failed.
2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.

Error log:
1.Servicing side error log:
image

@hzjane hzjane self-assigned this Sep 18, 2024
@hzjane
Copy link
Contributor

hzjane commented Sep 18, 2024

I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?

@dukelee111
Copy link
Author

It's not encountered when doing benchmark, starting vllm could succeed.

@ACupofAir
Copy link
Contributor

Cannot reproduce
Steps:

  1. start docker:
#!/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
export CONTAINER_NAME=junwang-vllm54-issue220

docker rm -f $CONTAINER_NAME
sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        --name=$CONTAINER_NAME \
        -v /home/intel/LLM:/llm/models/ \
        -v /home/intel/junwang:/workspace \
        -e no_proxy=localhost,127.0.0.1 \
        --shm-size="16g" \
        $DOCKER_IMAGE
  1. start serve:
#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"

export no_proxy=localhost,127.0.0.1

source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8001 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 256 \
  -tp 1 \
  --max-num-seqs 64
  #-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8
  1. curl script:
curl http://localhost:8001/v1/completions                 -H "Content-Type: application/json"             -d '{
                  "model": "Qwen1.5-14B-Chat",
                  "prompt": "San Francisco is a",
                  "max_tokens": 128
}'
  1. result
    1. offline
      image

    2. online
      image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants