Skip to content

[Usage]: How to run FP8 inference #453

@warlock135

Description

@warlock135

Your current environment

Version: v0.5.3.post1+Gaudi-1.18.0
Models: [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
Hardware: 8xHL-225

How would you like to use vllm

I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:

QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

However, when starting inference, vLLM reported an error.

ERROR 11-03 05:27:49 async_llm_engine.py:671] Engine iteration timed out. This should never happen!
ERROR 11-03 05:27:49 async_llm_engine.py:56] Engine background task failed
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     await waiter
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return_value = task.result()
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-03 05:27:49 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-03 05:27:49 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause

In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions