You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/automatic_prefix_caching/details.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ With this mapping, we can add another indirection in vLLM’s KV cache managemen
25
25
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
26
26
27
27
28
-
# Generalized Caching Policy
28
+
##Generalized Caching Policy
29
29
30
30
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
The following configurations have been validated to be function with
121
121
Gaudi2 devices. Configurations that are not listed may or may not work.
@@ -152,10 +152,10 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
152
152
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
153
153
154
154
Performance Tuning
155
-
==================
155
+
------------------
156
156
157
157
Execution modes
158
-
---------------
158
+
~~~~~~~~~~~~~~~
159
159
160
160
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via ``PT_HPU_LAZY_MODE`` environment variable), and ``--enforce-eager`` flag.
161
161
@@ -184,7 +184,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
184
184
185
185
186
186
Bucketing mechanism
187
-
-------------------
187
+
~~~~~~~~~~~~~~~~~~~
188
188
189
189
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. `Intel Gaudi Graph Compiler <https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime>`__ is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
190
190
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - ``batch_size`` and ``sequence_length``.
@@ -233,7 +233,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
233
233
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
234
234
235
235
Warmup
236
-
------
236
+
~~~~~~
237
237
238
238
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
239
239
@@ -257,7 +257,7 @@ This example uses the same buckets as in *Bucketing mechanism* section. Each out
257
257
Compiling all the buckets might take some time and can be turned off with ``VLLM_SKIP_WARMUP=true`` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
258
258
259
259
HPU Graph capture
260
-
-----------------
260
+
~~~~~~~~~~~~~~~~~
261
261
262
262
`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.
263
263
@@ -321,7 +321,7 @@ Each described step is logged by vLLM server, as follows (negative values corres
321
321
322
322
323
323
Recommended vLLM Parameters
324
-
---------------------------
324
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
325
325
326
326
- We recommend running inference on Gaudi 2 with ``block_size`` of 128
327
327
for BF16 data type. Using default values (16, 32) might lead to
@@ -333,7 +333,7 @@ Recommended vLLM Parameters
333
333
If you encounter out-of-memory issues, see troubleshooting section.
334
334
335
335
Environment variables
336
-
---------------------
336
+
~~~~~~~~~~~~~~~~~~~~~
337
337
338
338
**Diagnostic and profiling knobs:**
339
339
@@ -380,7 +380,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
380
380
- ``PT_HPU_ENABLE_LAZY_COLLECTIVES``: required to be ``true`` for tensor parallel inference with HPU Graphs
381
381
382
382
Troubleshooting: Tweaking HPU Graphs
383
-
====================================
383
+
------------------------------------
384
384
385
385
If you experience device out-of-memory issues or want to attempt
386
386
inference at higher batch sizes, try tweaking HPU Graphs by following
0 commit comments