Skip to content

Commit a877540

Browse files
wangxiyuanafeldman-nm
authored andcommitted
[doc] format fix (vllm-project#10789)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
1 parent 7831672 commit a877540

File tree

2 files changed

+19
-19
lines changed

2 files changed

+19
-19
lines changed

docs/source/automatic_prefix_caching/details.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ With this mapping, we can add another indirection in vLLM’s KV cache managemen
2525
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
2626

2727

28-
# Generalized Caching Policy
28+
## Generalized Caching Policy
2929

3030
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
3131

docs/source/getting_started/gaudi-installation.rst

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Installation with Intel® Gaudi® AI Accelerators
44
This README provides instructions on running vLLM with Intel Gaudi devices.
55

66
Requirements and Installation
7-
=============================
7+
-----------------------------
88

99
Please follow the instructions provided in the `Gaudi Installation
1010
Guide <https://docs.habana.ai/en/latest/Installation_Guide/index.html>`__
@@ -13,7 +13,7 @@ please follow the methods outlined in the `Optimizing Training Platform
1313
Guide <https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html>`__.
1414

1515
Requirements
16-
------------
16+
~~~~~~~~~~~~
1717

1818
- OS: Ubuntu 22.04 LTS
1919
- Python: 3.10
@@ -22,7 +22,7 @@ Requirements
2222

2323

2424
Quick start using Dockerfile
25-
----------------------------
25+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2626
.. code:: console
2727
2828
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
@@ -34,10 +34,10 @@ Quick start using Dockerfile
3434

3535

3636
Build from source
37-
-----------------
37+
~~~~~~~~~~~~~~~~~
3838

3939
Environment verification
40-
~~~~~~~~~~~~~~~~~~~~~~~~
40+
^^^^^^^^^^^^^^^^^^^^^^^^
4141

4242
To verify that the Intel Gaudi software was correctly installed, run:
4343

@@ -53,7 +53,7 @@ Verification <https://docs.habana.ai/en/latest/Installation_Guide/SW_Verificatio
5353
for more details.
5454

5555
Run Docker Image
56-
~~~~~~~~~~~~~~~~
56+
^^^^^^^^^^^^^^^^
5757

5858
It is highly recommended to use the latest Docker image from Intel Gaudi
5959
vault. Refer to the `Intel Gaudi
@@ -68,7 +68,7 @@ Use the following commands to run a Docker image:
6868
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
6969
7070
Build and Install vLLM
71-
~~~~~~~~~~~~~~~~~~~~~~
71+
^^^^^^^^^^^^^^^^^^^^^^
7272

7373
To build and install vLLM from source, run:
7474

@@ -90,7 +90,7 @@ Currently, the latest features and performance optimizations are developed in Ga
9090
9191
9292
Supported Features
93-
==================
93+
------------------
9494

9595
- `Offline batched
9696
inference <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference>`__
@@ -107,15 +107,15 @@ Supported Features
107107
- Attention with Linear Biases (ALiBi)
108108

109109
Unsupported Features
110-
====================
110+
--------------------
111111

112112
- Beam search
113113
- LoRA adapters
114114
- Quantization
115115
- Prefill chunking (mixed-batch inferencing)
116116

117117
Supported Configurations
118-
========================
118+
------------------------
119119

120120
The following configurations have been validated to be function with
121121
Gaudi2 devices. Configurations that are not listed may or may not work.
@@ -152,10 +152,10 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
152152
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
153153

154154
Performance Tuning
155-
==================
155+
------------------
156156

157157
Execution modes
158-
---------------
158+
~~~~~~~~~~~~~~~
159159

160160
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via ``PT_HPU_LAZY_MODE`` environment variable), and ``--enforce-eager`` flag.
161161

@@ -184,7 +184,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
184184

185185

186186
Bucketing mechanism
187-
-------------------
187+
~~~~~~~~~~~~~~~~~~~
188188

189189
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. `Intel Gaudi Graph Compiler <https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime>`__ is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
190190
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - ``batch_size`` and ``sequence_length``.
@@ -233,7 +233,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
233233
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
234234

235235
Warmup
236-
------
236+
~~~~~~
237237

238238
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
239239

@@ -257,7 +257,7 @@ This example uses the same buckets as in *Bucketing mechanism* section. Each out
257257
Compiling all the buckets might take some time and can be turned off with ``VLLM_SKIP_WARMUP=true`` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
258258

259259
HPU Graph capture
260-
-----------------
260+
~~~~~~~~~~~~~~~~~
261261

262262
`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.
263263

@@ -321,7 +321,7 @@ Each described step is logged by vLLM server, as follows (negative values corres
321321
322322
323323
Recommended vLLM Parameters
324-
---------------------------
324+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
325325

326326
- We recommend running inference on Gaudi 2 with ``block_size`` of 128
327327
for BF16 data type. Using default values (16, 32) might lead to
@@ -333,7 +333,7 @@ Recommended vLLM Parameters
333333
If you encounter out-of-memory issues, see troubleshooting section.
334334

335335
Environment variables
336-
---------------------
336+
~~~~~~~~~~~~~~~~~~~~~
337337

338338
**Diagnostic and profiling knobs:**
339339

@@ -380,7 +380,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
380380
- ``PT_HPU_ENABLE_LAZY_COLLECTIVES``: required to be ``true`` for tensor parallel inference with HPU Graphs
381381

382382
Troubleshooting: Tweaking HPU Graphs
383-
====================================
383+
------------------------------------
384384

385385
If you experience device out-of-memory issues or want to attempt
386386
inference at higher batch sizes, try tweaking HPU Graphs by following

0 commit comments

Comments
 (0)