[Usage]: How to run FP8 inference

### Your current environment

```
Version: v0.5.3.post1+Gaudi-1.18.0
Models: [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
Hardware: 8xHL-225
```


### How would you like to use vllm

I'm trying to run FP8 inference on [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) using vLLM with FP8 quantization. I successfully launched vLLM with the following command:
```
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu
```
However, when starting inference, vLLM reported an error.
```
ERROR 11-03 05:27:49 async_llm_engine.py:671] Engine iteration timed out. This should never happen!
ERROR 11-03 05:27:49 async_llm_engine.py:56] Engine background task failed
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     await waiter
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return_value = task.result()
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-03 05:27:49 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-03 05:27:49 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause
```
In addition, the warm-up phase with this setup took about 10 hours to complete. 
What is the correct way to run FP8 inference with this vLLM fork?

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Usage]: How to run FP8 inference #453

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Usage]: How to run FP8 inference #453

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions