-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Closed
Description
export CUDA_HOME=/usr/local/cuda-12.3
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123"
pip install git+https://github.com/vllm-project/vllm.git --upgrade
export CUDA_VISIBLE_DEVICES=1
python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ --quantization awq --dtype auto --seed 1234 --tensor-parallel-size=1 --max-num-batched-tokens=66560 --max-log-len=100
Any where, even simple, leads to:
INFO 01-27 01:15:31 api_server.py:209] args: Namespace(host='0.0.0.0', port=5002, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instru>
WARNING 01-27 01:15:31 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-27 01:15:31 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=>
INFO 01-27 01:15:33 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 01-27 01:17:50 llm_engine.py:316] # GPU blocks: 12486, # CPU blocks: 2048
INFO 01-27 01:17:51 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-27 01:17:51 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-27 01:18:04 model_runner.py:689] Graph capturing finished in 13 secs.
INFO 01-27 01:18:04 serving_chat.py:260] Using default chat template:
INFO 01-27 01:18:04 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'ass>
INFO: Started server process [276444]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5002 (Press CTRL+C to quit)
INFO 01-27 01:18:14 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:24 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:34 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:41 async_llm_engine.py:433] Received request cmpl-cd9d75c607614e7db704b01164bc0c83-0: prompt: None, prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.14000000000000012, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early>
INFO: 52.0.25.199:43684 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-27 01:18:41 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 257, in wrap
await func()
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 234, in listen_for_disconnect
message = await receive()
File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
await self.message_event.wait()
File "/ephemeral/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd214489990
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
await self.asgi_callable(scope, receive, wrapped_send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
await self.middleware_stack(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
await route.handle(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
await response(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 254, in __call__
async with anyio.create_task_group() as task_group:
File "/ephemeral/vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 01-27 01:18:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:51 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:01 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:11 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:21 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:31 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:33 async_llm_engine.py:112] Finished request cmpl-cd9d75c607614e7db704b01164bc0c83-0.
INFO 01-27 01:19:44 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:54 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:20:04 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:20:14 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:20:24 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
simon376 and Vincent777
Metadata
Metadata
Assignees
Labels
No labels