Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

Open
oldmikeyang opened this issue Sep 14, 2024 · 0 comments
Open

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU. #12079

oldmikeyang opened this issue Sep 14, 2024 · 0 comments

Comments

@oldmikeyang
Copy link

oldmikeyang commented Sep 14, 2024

The vLLM docker image is
intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1

vLLM start command is
`model="/llm/models/meta-llama/LLaMA-33B-HF/"
served_model_name="LLaMA-33B-HF"

source /opt/intel/1ccl-wks/setvars.sh

export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.8
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit sym_int4
--max-model-len 2048
--max-num-batched-tokens 3000
--max-num-seqs 16
-tp 4 --disable-log-requests
`

The error information is
INFO 09-14 10:32:15 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%.
INFO 09-14 10:32:41 metrics.py:406] Avg prompt throughput: 38.5 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.9%, CPU KV cache usage: 0.0%.
INFO: 127.0.0.1:51180 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50250 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50258 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50266 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50270 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50278 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50280 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50286 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:50292 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 09-14 10:32:47 metrics.py:406] Avg prompt throughput: 242.2 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
(WrapperWithLoadBit pid=3547) GPU-Xeon4410Y-ARC770:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
(WrapperWithLoadBit pid=3547) 2024-09-14 10:25:54,116 - INFO - Loading model weights took 4.1510 GB [repeated 2x across cluster]
(WrapperWithLoadBit pid=3547) [1726281167.063876801] GPU-Xeon4410Y-ARC770:rank1.perWithLoadBit.execute_method: Reading from remote process' memory failed. Disabling CMA support
(WrapperWithLoadBit pid=3989) WARNING 09-14 10:26:11 utils.py:564] Pin memory is not supported on XPU. [repeated 2x across cluster]
INFO 09-14 10:33:01 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff6f90abdfc17b3821298007f801000000 Worker ID: f3c6ea8e49dbf827e58908a85281342ea5f7a9646c64d71ddeca2031 Node ID: bda0a76065dd020d4eefe01b3bb9e7d4de06e10d3749bee600090abf Worker IP address: 10.240.108.91 Worker port: 46219 Worker PID: 3768 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WrapperWithLoadBit pid=3989) [1726281167.063911306] GPU-Xeon4410Y-ARC770:rank3.perWithLoadBit.execute_method: Reading from remote process' memory failed. Disabling CMA support [repeated 2x across cluster]
INFO 09-14 10:33:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb7844226a5b8a8d518279add01000000 Worker ID: 03229cf0580d5eeb7de0700c8093ec5229d30d4a8f2429b7091fdb6c Node ID: bda0a76065dd020d4eefe01b3bb9e7d4de06e10d3749bee600090abf Worker IP address: 10.240.108.91 Worker port: 37889 Worker PID: 3547 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
INFO 09-14 10:33:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff6fd757d1bddcdf4323576bee01000000 Worker ID: af76f74572b06d5def6df66c2a6b9c106a4cf0412259550313948806 Node ID: bda0a76065dd020d4eefe01b3bb9e7d4de06e10d3749bee600090abf Worker IP address: 10.240.108.91 Worker port: 38539 Worker PID: 3989 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
INFO 09-14 10:33:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
^CProcess ForkProcess-58:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants