LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

oldmikeyang · 2024-09-14T02:38:26Z

The vLLM docker image is
intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1

vLLM start command is
`model="/llm/models/meta-llama/LLaMA-33B-HF/"
served_model_name="LLaMA-33B-HF"

source /opt/intel/1ccl-wks/setvars.sh

export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.8
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit sym_int4
--max-model-len 2048
--max-num-batched-tokens 3000
--max-num-seqs 16
-tp 4 --disable-log-requests
`

The error information is
INFO 09-14 10:32:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%.
INFO 09-14 10:32:41 metrics.py:406] Avg prompt throughput: 38.5 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.9%, CPU KV cache usage: 0.0%.
INFO: 127.0.0.1:51180 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50250 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50258 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50266 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50270 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50278 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50280 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50286 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50292 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 09-14 10:32:47 metrics.py:406] Avg prompt throughput: 242.2 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
(WrapperWithLoadBit pid=3547) GPU-Xeon4410Y-ARC770:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
(WrapperWithLoadBit pid=3547) 2024-09-14 10:25:54,116 - INFO - Loading model weights took 4.1510 GB [repeated 2x across cluster]
(WrapperWithLoadBit pid=3547) [1726281167.063876801] GPU-Xeon4410Y-ARC770:rank1.perWithLoadBit.execute_method: Reading from remote process' memory failed. Disabling CMA support
(WrapperWithLoadBit pid=3989) WARNING 09-14 10:26:11 utils.py:564] Pin memory is not supported on XPU. [repeated 2x across cluster]
INFO 09-14 10:33:01 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff6f90abdfc17b3821298007f801000000 Worker ID: f3c6ea8e49dbf827e58908a85281342ea5f7a9646c64d71ddeca2031 Node ID: bda0a76065dd020d4eefe01b3bb9e7d4de06e10d3749bee600090abf Worker IP address: 10.240.108.91 Worker port: 46219 Worker PID: 3768 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WrapperWithLoadBit pid=3989) [1726281167.063911306] GPU-Xeon4410Y-ARC770:rank3.perWithLoadBit.execute_method: Reading from remote process' memory failed. Disabling CMA support [repeated 2x across cluster]
INFO 09-14 10:33:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb7844226a5b8a8d518279add01000000 Worker ID: 03229cf0580d5eeb7de0700c8093ec5229d30d4a8f2429b7091fdb6c Node ID: bda0a76065dd020d4eefe01b3bb9e7d4de06e10d3749bee600090abf Worker IP address: 10.240.108.91 Worker port: 37889 Worker PID: 3547 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
INFO 09-14 10:33:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff6fd757d1bddcdf4323576bee01000000 Worker ID: af76f74572b06d5def6df66c2a6b9c106a4cf0412259550313948806 Node ID: bda0a76065dd020d4eefe01b3bb9e7d4de06e10d3749bee600090abf Worker IP address: 10.240.108.91 Worker port: 38539 Worker PID: 3989 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
INFO 09-14 10:33:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
^CProcess ForkProcess-58:

The text was updated successfully, but these errors were encountered:

glorysdj assigned xiangyuT Sep 14, 2024

glorysdj added user issue multi-arc labels Sep 14, 2024

glorysdj assigned ACupofAir and unassigned xiangyuT Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

oldmikeyang commented Sep 14, 2024 •

edited

Loading

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

Comments

oldmikeyang commented Sep 14, 2024 • edited Loading

oldmikeyang commented Sep 14, 2024 •

edited

Loading