+
+vLLM is a fast and easy-to-use library for LLM inference and serving.
+
+Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
+
+vLLM is fast with:
+
+- State-of-the-art serving throughput
+- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
+- Continuous batching of incoming requests
+- Fast model execution with CUDA/HIP graph
+- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
+- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
+- Speculative decoding
+- Chunked prefill
+
+vLLM is flexible and easy to use with:
+
+- Seamless integration with popular HuggingFace models
+- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
+- Tensor parallelism and pipeline parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudiยฎ accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
+- Prefix caching support
+- Multi-lora support
+
+For more information, check out the following:
+
+- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
+- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
+- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
+- [vLLM Meetups][meetups]
diff --git a/docs/api/README.md b/docs/api/README.md
new file mode 100644
index 000000000000..5c7b2ca79ee2
--- /dev/null
+++ b/docs/api/README.md
@@ -0,0 +1,107 @@
+# Summary
+
+[](){ #configuration }
+
+## Configuration
+
+API documentation for vLLM's configuration classes.
+
+- [vllm.config.ModelConfig][]
+- [vllm.config.CacheConfig][]
+- [vllm.config.TokenizerPoolConfig][]
+- [vllm.config.LoadConfig][]
+- [vllm.config.ParallelConfig][]
+- [vllm.config.SchedulerConfig][]
+- [vllm.config.DeviceConfig][]
+- [vllm.config.SpeculativeConfig][]
+- [vllm.config.LoRAConfig][]
+- [vllm.config.PromptAdapterConfig][]
+- [vllm.config.MultiModalConfig][]
+- [vllm.config.PoolerConfig][]
+- [vllm.config.DecodingConfig][]
+- [vllm.config.ObservabilityConfig][]
+- [vllm.config.KVTransferConfig][]
+- [vllm.config.CompilationConfig][]
+- [vllm.config.VllmConfig][]
+
+[](){ #offline-inference-api }
+
+## Offline Inference
+
+LLM Class.
+
+- [vllm.LLM][]
+
+LLM Inputs.
+
+- [vllm.inputs.PromptType][]
+- [vllm.inputs.TextPrompt][]
+- [vllm.inputs.TokensPrompt][]
+
+## vLLM Engines
+
+Engine classes for offline and online inference.
+
+- [vllm.LLMEngine][]
+- [vllm.AsyncLLMEngine][]
+
+## Inference Parameters
+
+Inference parameters for vLLM APIs.
+
+[](){ #sampling-params }
+[](){ #pooling-params }
+
+- [vllm.SamplingParams][]
+- [vllm.PoolingParams][]
+
+[](){ #multi-modality }
+
+## Multi-Modality
+
+vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
+
+Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
+via the `multi_modal_data` field in [vllm.inputs.PromptType][].
+
+Looking to add your own multi-modal model? Please follow the instructions listed [here][supports-multimodal].
+
+- [vllm.multimodal.MULTIMODAL_REGISTRY][]
+
+### Inputs
+
+User-facing inputs.
+
+- [vllm.multimodal.inputs.MultiModalDataDict][]
+
+Internal data structures.
+
+- [vllm.multimodal.inputs.PlaceholderRange][]
+- [vllm.multimodal.inputs.NestedTensors][]
+- [vllm.multimodal.inputs.MultiModalFieldElem][]
+- [vllm.multimodal.inputs.MultiModalFieldConfig][]
+- [vllm.multimodal.inputs.MultiModalKwargsItem][]
+- [vllm.multimodal.inputs.MultiModalKwargs][]
+- [vllm.multimodal.inputs.MultiModalInputs][]
+
+### Data Parsing
+
+- [vllm.multimodal.parse][]
+
+### Data Processing
+
+- [vllm.multimodal.processing][]
+
+### Memory Profiling
+
+- [vllm.multimodal.profiling][]
+
+### Registry
+
+- [vllm.multimodal.registry][]
+
+## Model Development
+
+- [vllm.model_executor.models.interfaces_base][]
+- [vllm.model_executor.models.interfaces][]
+- [vllm.model_executor.models.adapters][]
diff --git a/docs/api/vllm/.meta.yml b/docs/api/vllm/.meta.yml
new file mode 100644
index 000000000000..c15adfec644c
--- /dev/null
+++ b/docs/api/vllm/.meta.yml
@@ -0,0 +1,2 @@
+search:
+ boost: 0.5
diff --git a/docs/source/assets/contributing/dockerfile-stages-dependency.png b/docs/assets/contributing/dockerfile-stages-dependency.png
similarity index 100%
rename from docs/source/assets/contributing/dockerfile-stages-dependency.png
rename to docs/assets/contributing/dockerfile-stages-dependency.png
diff --git a/docs/source/assets/deployment/anything-llm-chat-with-doc.png b/docs/assets/deployment/anything-llm-chat-with-doc.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-chat-with-doc.png
rename to docs/assets/deployment/anything-llm-chat-with-doc.png
diff --git a/docs/source/assets/deployment/anything-llm-chat-without-doc.png b/docs/assets/deployment/anything-llm-chat-without-doc.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-chat-without-doc.png
rename to docs/assets/deployment/anything-llm-chat-without-doc.png
diff --git a/docs/source/assets/deployment/anything-llm-provider.png b/docs/assets/deployment/anything-llm-provider.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-provider.png
rename to docs/assets/deployment/anything-llm-provider.png
diff --git a/docs/source/assets/deployment/anything-llm-upload-doc.png b/docs/assets/deployment/anything-llm-upload-doc.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-upload-doc.png
rename to docs/assets/deployment/anything-llm-upload-doc.png
diff --git a/docs/source/assets/deployment/architecture_helm_deployment.png b/docs/assets/deployment/architecture_helm_deployment.png
similarity index 100%
rename from docs/source/assets/deployment/architecture_helm_deployment.png
rename to docs/assets/deployment/architecture_helm_deployment.png
diff --git a/docs/source/assets/deployment/chatbox-chat.png b/docs/assets/deployment/chatbox-chat.png
similarity index 100%
rename from docs/source/assets/deployment/chatbox-chat.png
rename to docs/assets/deployment/chatbox-chat.png
diff --git a/docs/source/assets/deployment/chatbox-settings.png b/docs/assets/deployment/chatbox-settings.png
similarity index 100%
rename from docs/source/assets/deployment/chatbox-settings.png
rename to docs/assets/deployment/chatbox-settings.png
diff --git a/docs/source/assets/deployment/dify-chat.png b/docs/assets/deployment/dify-chat.png
similarity index 100%
rename from docs/source/assets/deployment/dify-chat.png
rename to docs/assets/deployment/dify-chat.png
diff --git a/docs/source/assets/deployment/dify-create-chatbot.png b/docs/assets/deployment/dify-create-chatbot.png
similarity index 100%
rename from docs/source/assets/deployment/dify-create-chatbot.png
rename to docs/assets/deployment/dify-create-chatbot.png
diff --git a/docs/source/assets/deployment/dify-settings.png b/docs/assets/deployment/dify-settings.png
similarity index 100%
rename from docs/source/assets/deployment/dify-settings.png
rename to docs/assets/deployment/dify-settings.png
diff --git a/docs/source/assets/deployment/open_webui.png b/docs/assets/deployment/open_webui.png
similarity index 100%
rename from docs/source/assets/deployment/open_webui.png
rename to docs/assets/deployment/open_webui.png
diff --git a/docs/source/assets/deployment/streamlit-chat.png b/docs/assets/deployment/streamlit-chat.png
similarity index 100%
rename from docs/source/assets/deployment/streamlit-chat.png
rename to docs/assets/deployment/streamlit-chat.png
diff --git a/docs/source/assets/design/arch_overview/entrypoints.excalidraw.png b/docs/assets/design/arch_overview/entrypoints.excalidraw.png
similarity index 100%
rename from docs/source/assets/design/arch_overview/entrypoints.excalidraw.png
rename to docs/assets/design/arch_overview/entrypoints.excalidraw.png
diff --git a/docs/source/assets/design/arch_overview/llm_engine.excalidraw.png b/docs/assets/design/arch_overview/llm_engine.excalidraw.png
similarity index 100%
rename from docs/source/assets/design/arch_overview/llm_engine.excalidraw.png
rename to docs/assets/design/arch_overview/llm_engine.excalidraw.png
diff --git a/docs/source/assets/design/hierarchy.png b/docs/assets/design/hierarchy.png
similarity index 100%
rename from docs/source/assets/design/hierarchy.png
rename to docs/assets/design/hierarchy.png
diff --git a/docs/source/assets/design/v1/metrics/intervals-1.png b/docs/assets/design/v1/metrics/intervals-1.png
similarity index 100%
rename from docs/source/assets/design/v1/metrics/intervals-1.png
rename to docs/assets/design/v1/metrics/intervals-1.png
diff --git a/docs/source/assets/design/v1/metrics/intervals-2.png b/docs/assets/design/v1/metrics/intervals-2.png
similarity index 100%
rename from docs/source/assets/design/v1/metrics/intervals-2.png
rename to docs/assets/design/v1/metrics/intervals-2.png
diff --git a/docs/source/assets/design/v1/metrics/intervals-3.png b/docs/assets/design/v1/metrics/intervals-3.png
similarity index 100%
rename from docs/source/assets/design/v1/metrics/intervals-3.png
rename to docs/assets/design/v1/metrics/intervals-3.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-1.png b/docs/assets/design/v1/prefix_caching/example-time-1.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-1.png
rename to docs/assets/design/v1/prefix_caching/example-time-1.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-3.png b/docs/assets/design/v1/prefix_caching/example-time-3.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-3.png
rename to docs/assets/design/v1/prefix_caching/example-time-3.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-4.png b/docs/assets/design/v1/prefix_caching/example-time-4.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-4.png
rename to docs/assets/design/v1/prefix_caching/example-time-4.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-5.png b/docs/assets/design/v1/prefix_caching/example-time-5.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-5.png
rename to docs/assets/design/v1/prefix_caching/example-time-5.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-6.png b/docs/assets/design/v1/prefix_caching/example-time-6.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-6.png
rename to docs/assets/design/v1/prefix_caching/example-time-6.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-7.png b/docs/assets/design/v1/prefix_caching/example-time-7.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-7.png
rename to docs/assets/design/v1/prefix_caching/example-time-7.png
diff --git a/docs/source/assets/design/v1/prefix_caching/free.png b/docs/assets/design/v1/prefix_caching/free.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/free.png
rename to docs/assets/design/v1/prefix_caching/free.png
diff --git a/docs/source/assets/design/v1/prefix_caching/overview.png b/docs/assets/design/v1/prefix_caching/overview.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/overview.png
rename to docs/assets/design/v1/prefix_caching/overview.png
diff --git a/docs/source/assets/features/disagg_prefill/abstraction.jpg b/docs/assets/features/disagg_prefill/abstraction.jpg
similarity index 100%
rename from docs/source/assets/features/disagg_prefill/abstraction.jpg
rename to docs/assets/features/disagg_prefill/abstraction.jpg
diff --git a/docs/source/assets/features/disagg_prefill/overview.jpg b/docs/assets/features/disagg_prefill/overview.jpg
similarity index 100%
rename from docs/source/assets/features/disagg_prefill/overview.jpg
rename to docs/assets/features/disagg_prefill/overview.jpg
diff --git a/docs/source/assets/kernel/k_vecs.png b/docs/assets/kernel/k_vecs.png
similarity index 100%
rename from docs/source/assets/kernel/k_vecs.png
rename to docs/assets/kernel/k_vecs.png
diff --git a/docs/source/assets/kernel/key.png b/docs/assets/kernel/key.png
similarity index 100%
rename from docs/source/assets/kernel/key.png
rename to docs/assets/kernel/key.png
diff --git a/docs/source/assets/kernel/logits_vec.png b/docs/assets/kernel/logits_vec.png
similarity index 100%
rename from docs/source/assets/kernel/logits_vec.png
rename to docs/assets/kernel/logits_vec.png
diff --git a/docs/source/assets/kernel/q_vecs.png b/docs/assets/kernel/q_vecs.png
similarity index 100%
rename from docs/source/assets/kernel/q_vecs.png
rename to docs/assets/kernel/q_vecs.png
diff --git a/docs/source/assets/kernel/query.png b/docs/assets/kernel/query.png
similarity index 100%
rename from docs/source/assets/kernel/query.png
rename to docs/assets/kernel/query.png
diff --git a/docs/source/assets/kernel/v_vec.png b/docs/assets/kernel/v_vec.png
similarity index 100%
rename from docs/source/assets/kernel/v_vec.png
rename to docs/assets/kernel/v_vec.png
diff --git a/docs/source/assets/kernel/value.png b/docs/assets/kernel/value.png
similarity index 100%
rename from docs/source/assets/kernel/value.png
rename to docs/assets/kernel/value.png
diff --git a/docs/source/assets/logos/vllm-logo-only-light.ico b/docs/assets/logos/vllm-logo-only-light.ico
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-only-light.ico
rename to docs/assets/logos/vllm-logo-only-light.ico
diff --git a/docs/source/assets/logos/vllm-logo-only-light.png b/docs/assets/logos/vllm-logo-only-light.png
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-only-light.png
rename to docs/assets/logos/vllm-logo-only-light.png
diff --git a/docs/source/assets/logos/vllm-logo-text-dark.png b/docs/assets/logos/vllm-logo-text-dark.png
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-text-dark.png
rename to docs/assets/logos/vllm-logo-text-dark.png
diff --git a/docs/source/assets/logos/vllm-logo-text-light.png b/docs/assets/logos/vllm-logo-text-light.png
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-text-light.png
rename to docs/assets/logos/vllm-logo-text-light.png
diff --git a/docs/source/community/meetups.md b/docs/community/meetups.md
similarity index 98%
rename from docs/source/community/meetups.md
rename to docs/community/meetups.md
index aa1a71c86c0a..8ea42e3cad18 100644
--- a/docs/source/community/meetups.md
+++ b/docs/community/meetups.md
@@ -1,6 +1,7 @@
-(meetups)=
-
-# vLLM Meetups
+---
+title: Meetups
+---
+[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
diff --git a/docs/source/community/sponsors.md b/docs/community/sponsors.md
similarity index 100%
rename from docs/source/community/sponsors.md
rename to docs/community/sponsors.md
diff --git a/docs/configuration/README.md b/docs/configuration/README.md
new file mode 100644
index 000000000000..6a8fbc79f4af
--- /dev/null
+++ b/docs/configuration/README.md
@@ -0,0 +1,9 @@
+# Configuration Options
+
+This section lists the most common options for running vLLM.
+
+There are three main levels of configuration, from highest priority to lowest priority:
+
+- [Request parameters][completions-api] and [input arguments][sampling-params]
+- [Engine arguments](./engine_args.md)
+- [Environment variables](./env_vars.md)
diff --git a/docs/configuration/conserving_memory.md b/docs/configuration/conserving_memory.md
new file mode 100644
index 000000000000..a1283a503a6d
--- /dev/null
+++ b/docs/configuration/conserving_memory.md
@@ -0,0 +1,144 @@
+# Conserving Memory
+
+Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
+
+## Tensor Parallelism (TP)
+
+Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
+
+The following code splits the model across 2 GPUs.
+
+```python
+from vllm import LLM
+
+llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
+ tensor_parallel_size=2)
+```
+
+!!! warning
+ To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
+ before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+
+ To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
+
+!!! note
+ With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
+
+ You can convert the model checkpoint to a sharded checkpoint using . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
+
+## Quantization
+
+Quantized models take less memory at the cost of lower precision.
+
+Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
+and used directly without extra configuration.
+
+Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
+
+## Context length and batch size
+
+You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
+and the maximum batch size (`max_num_seqs` option).
+
+```python
+from vllm import LLM
+
+llm = LLM(model="adept/fuyu-8b",
+ max_model_len=2048,
+ max_num_seqs=2)
+```
+
+## Reduce CUDA Graphs
+
+By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
+
+!!! warning
+ CUDA graph capture takes up more memory in V1 than in V0.
+
+You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
+
+```python
+from vllm import LLM
+from vllm.config import CompilationConfig, CompilationLevel
+
+llm = LLM(
+ model="meta-llama/Llama-3.1-8B-Instruct",
+ compilation_config=CompilationConfig(
+ level=CompilationLevel.PIECEWISE,
+ # By default, it goes up to max_num_seqs
+ cudagraph_capture_sizes=[1, 2, 4, 8, 16],
+ ),
+)
+```
+
+You can disable graph capturing completely via the `enforce_eager` flag:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
+ enforce_eager=True)
+```
+
+## Adjust cache size
+
+If you run out of CPU RAM, try the following options:
+
+- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
+- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
+
+## Multi-modal input limits
+
+You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
+
+```python
+from vllm import LLM
+
+# Accept up to 3 images and 1 video per prompt
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+ limit_mm_per_prompt={"image": 3, "video": 1})
+```
+
+You can go a step further and disable unused modalities completely by setting its limit to zero.
+For example, if your application only accepts image input, there is no need to allocate any memory for videos.
+
+```python
+from vllm import LLM
+
+# Accept any number of images but no videos
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+ limit_mm_per_prompt={"video": 0})
+```
+
+You can even run a multi-modal model for text-only inference:
+
+```python
+from vllm import LLM
+
+# Don't accept images. Just text.
+llm = LLM(model="google/gemma-3-27b-it",
+ limit_mm_per_prompt={"image": 0})
+```
+
+## Multi-modal processor arguments
+
+For certain models, you can adjust the multi-modal processor arguments to
+reduce the size of the processed multi-modal inputs, which in turn saves memory.
+
+Here are some examples:
+
+```python
+from vllm import LLM
+
+# Available for Qwen2-VL series models
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+ mm_processor_kwargs={
+ "max_pixels": 768 * 768, # Default is 1280 * 28 * 28
+ })
+
+# Available for InternVL series models
+llm = LLM(model="OpenGVLab/InternVL2-2B",
+ mm_processor_kwargs={
+ "max_dynamic_patch": 4, # Default is 12
+ })
+```
diff --git a/docs/configuration/engine_args.md b/docs/configuration/engine_args.md
new file mode 100644
index 000000000000..fb2689a56391
--- /dev/null
+++ b/docs/configuration/engine_args.md
@@ -0,0 +1,18 @@
+---
+title: Engine Arguments
+---
+[](){ #engine-args }
+
+Engine arguments control the behavior of the vLLM engine.
+
+- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
+- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
+
+You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
+
+However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
+
+For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
+
+!!! note
+ Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
diff --git a/docs/configuration/env_vars.md b/docs/configuration/env_vars.md
new file mode 100644
index 000000000000..f6d548a19d91
--- /dev/null
+++ b/docs/configuration/env_vars.md
@@ -0,0 +1,12 @@
+# Environment Variables
+
+vLLM uses the following environment variables to configure the system:
+
+!!! warning
+ Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
+
+ All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
+
+```python
+--8<-- "vllm/envs.py:env-vars-definition"
+```
diff --git a/docs/configuration/model_resolution.md b/docs/configuration/model_resolution.md
new file mode 100644
index 000000000000..8757c257d3e9
--- /dev/null
+++ b/docs/configuration/model_resolution.md
@@ -0,0 +1,23 @@
+# Model Resolution
+
+vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
+and finding the corresponding implementation that is registered to vLLM.
+Nevertheless, our model resolution may fail for the following reasons:
+
+- The `config.json` of the model repository lacks the `architectures` field.
+- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
+- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
+
+To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
+For example:
+
+```python
+from vllm import LLM
+
+model = LLM(
+ model="cerebras/Cerebras-GPT-1.3B",
+ hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
+)
+```
+
+Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
diff --git a/docs/source/performance/optimization.md b/docs/configuration/optimization.md
similarity index 99%
rename from docs/source/performance/optimization.md
rename to docs/configuration/optimization.md
index 4160f0784962..811925c19e63 100644
--- a/docs/source/performance/optimization.md
+++ b/docs/configuration/optimization.md
@@ -1,5 +1,3 @@
-(optimization-and-tuning)=
-
# Optimization and Tuning
This guide covers optimization strategies and performance tuning for vLLM V1.
@@ -26,7 +24,7 @@ You can monitor the number of preemption requests through Prometheus metrics exp
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
-(chunked-prefill)=
+[](){ #chunked-prefill }
## Chunked Prefill
diff --git a/docs/configuration/serve_args.md b/docs/configuration/serve_args.md
new file mode 100644
index 000000000000..16b4b29f45d9
--- /dev/null
+++ b/docs/configuration/serve_args.md
@@ -0,0 +1,38 @@
+---
+title: Server Arguments
+---
+[](){ #serve-args }
+
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+
+## CLI Arguments
+
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+To see the available CLI arguments, run `vllm serve --help`!
+
+## Configuration file
+
+You can load CLI arguments via a [YAML](https://yaml.org/) config file.
+The argument names must be the long form of those outlined [above][serve-args].
+
+For example:
+
+```yaml
+# config.yaml
+
+model: meta-llama/Llama-3.1-8B-Instruct
+host: "127.0.0.1"
+port: 6379
+uvicorn-log-level: "info"
+```
+
+To use the above config file:
+
+```bash
+vllm serve --config config.yaml
+```
+
+!!! note
+ In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
+ The order of priorities is `command line > config file values > defaults`.
+ e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
diff --git a/docs/source/contributing/overview.md b/docs/contributing/README.md
similarity index 84%
rename from docs/source/contributing/overview.md
rename to docs/contributing/README.md
index 89b31f0311e2..2517436afcc1 100644
--- a/docs/source/contributing/overview.md
+++ b/docs/contributing/README.md
@@ -16,9 +16,9 @@ Finally, one of the most impactful ways to support us is by raising awareness ab
Unsure on where to start? Check out the following links for tasks to work on:
- [Good first issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
- - [Selected onboarding tasks](gh-project:6)
+ - [Selected onboarding tasks](gh-project:6)
- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new-model%22)
- - [Models with multi-modal capabilities](gh-project:10)
+ - [Models with multi-modal capabilities](gh-project:10)
## License
@@ -27,7 +27,21 @@ See .
## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
-Check out the [building from source](#build-from-source) documentation for details.
+Check out the [building from source][build-from-source] documentation for details.
+
+### Building the docs
+
+Install the dependencies:
+
+```bash
+pip install -r requirements/docs.txt
+```
+
+Start the autoreloading MkDocs server:
+
+```bash
+mkdocs serve
+```
## Testing
@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files
pytest tests/
```
-:::{tip}
-Since the ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
+!!! tip
+ Since the ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
-Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
-:::
+ Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
-:::{note}
-Currently, the repository is not fully checked by `mypy`.
-:::
+!!! note
+ Currently, the repository is not fully checked by `mypy`.
-:::{note}
-Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
-platform to run unit tests locally, rely on the continuous integration system to run the tests for
-now.
-:::
+!!! note
+ Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
+ platform to run unit tests locally, rely on the continuous integration system to run the tests for
+ now.
## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
-:::{important}
-If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
-:::
+!!! warning
+ If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
## Pull Requests & Code Reviews
@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly.
-:::{note}
-If the PR spans more than one category, please include all relevant prefixes.
-:::
+!!! note
+ If the PR spans more than one category, please include all relevant prefixes.
### Code Quality
@@ -121,9 +130,8 @@ The PR needs to meet the following code quality standards:
understand the code.
- Include sufficient tests to ensure the project stays correct and robust. This
includes both unit tests and integration tests.
-- Please add documentation to `docs/source/` if the PR modifies the
- user-facing behaviors of vLLM. It helps vLLM users understand and utilize the
- new features or changes.
+- Please add documentation to `docs/` if the PR modifies the user-facing behaviors of vLLM.
+ It helps vLLM users understand and utilize the new features or changes.
### Adding or Changing Kernels
diff --git a/docs/source/performance/benchmarks.md b/docs/contributing/benchmarks.md
similarity index 86%
rename from docs/source/performance/benchmarks.md
rename to docs/contributing/benchmarks.md
index 39dc470a1c70..00505fc6f2a9 100644
--- a/docs/source/performance/benchmarks.md
+++ b/docs/contributing/benchmarks.md
@@ -1,13 +1,14 @@
-(benchmarks)=
-
-# Benchmark Suites
+---
+title: Benchmark Suites
+---
+[](){ #benchmarks }
vLLM contains two sets of benchmarks:
-- [Performance benchmarks](#performance-benchmarks)
-- [Nightly benchmarks](#nightly-benchmarks)
+- [Performance benchmarks][performance-benchmarks]
+- [Nightly benchmarks][nightly-benchmarks]
-(performance-benchmarks)=
+[](){ #performance-benchmarks }
## Performance Benchmarks
@@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
-(nightly-benchmarks)=
+[](){ #nightly-benchmarks }
## Nightly Benchmarks
diff --git a/docs/source/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md
similarity index 100%
rename from docs/source/contributing/deprecation_policy.md
rename to docs/contributing/deprecation_policy.md
diff --git a/docs/source/contributing/dockerfile/dockerfile.md b/docs/contributing/dockerfile/dockerfile.md
similarity index 82%
rename from docs/source/contributing/dockerfile/dockerfile.md
rename to docs/contributing/dockerfile/dockerfile.md
index 90b9a33cfbe6..a39f335c87b8 100644
--- a/docs/source/contributing/dockerfile/dockerfile.md
+++ b/docs/contributing/dockerfile/dockerfile.md
@@ -1,7 +1,7 @@
# Dockerfile
We provide a to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](#deployment-docker).
+More information about deploying with Docker can be found [here][deployment-docker].
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
@@ -17,18 +17,21 @@ The edges of the build graph represent:
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
- > :::{figure} /assets/contributing/dockerfile-stages-dependency.png
- > :align: center
- > :alt: query
- > :width: 100%
- > :::
+ >
+ > { align="center" alt="query" width="100%" }
+ >
>
> Made using:
>
> Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
>
> ```bash
- > dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 --filename docker/Dockerfile
+ > dockerfilegraph \
+ > -o png \
+ > --legend \
+ > --dpi 200 \
+ > --max-label-length 50 \
+ > --filename docker/Dockerfile
> ```
>
> or in case you want to run it directly with the docker image:
diff --git a/docs/contributing/model/README.md b/docs/contributing/model/README.md
new file mode 100644
index 000000000000..b7727f02c11b
--- /dev/null
+++ b/docs/contributing/model/README.md
@@ -0,0 +1,23 @@
+---
+title: Adding a New Model
+---
+[](){ #new-model }
+
+This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
+
+Contents:
+
+- [Basic](basic.md)
+- [Registration](registration.md)
+- [Tests](tests.md)
+- [Multimodal](multimodal.md)
+
+!!! note
+ The complexity of adding a new model depends heavily on the model's architecture.
+ The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
+ However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
+
+!!! tip
+ If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
+ or ask on our [developer slack](https://slack.vllm.ai).
+ We will be happy to help you out!
diff --git a/docs/source/contributing/model/basic.md b/docs/contributing/model/basic.md
similarity index 82%
rename from docs/source/contributing/model/basic.md
rename to docs/contributing/model/basic.md
index ad31995f76be..0c0ba3379257 100644
--- a/docs/source/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@@ -1,6 +1,7 @@
-(new-model-basic)=
-
-# Implementing a Basic Model
+---
+title: Implementing a Basic Model
+---
+[](){ #new-model-basic }
This guide walks you through the steps to implement a basic vLLM model.
@@ -10,9 +11,8 @@ First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
-:::{warning}
-Make sure to review and adhere to the original code's copyright and licensing terms!
-:::
+!!! warning
+ Make sure to review and adhere to the original code's copyright and licensing terms!
## 2. Make your code compatible with vLLM
@@ -67,7 +67,7 @@ class MyModel(nn.Module):
...
```
-- Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
+- Rewrite the [forward][torch.nn.Module.forward] method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
```python
def forward(
@@ -78,10 +78,9 @@ def forward(
...
```
-:::{note}
-Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
-If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
-:::
+!!! note
+ Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
+ If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out for more examples.
@@ -89,7 +88,7 @@ For reference, check out our [Llama implementation](gh-file:vllm/model_executor/
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
-For the embedding layer, you can simply replace {class}`torch.nn.Embedding` with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
+For the embedding layer, you can simply replace [torch.nn.Embedding][] with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
When it comes to the linear layers, we provide the following options to parallelize them:
- `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
@@ -107,7 +106,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
## 5. Register your model
-See [this page](#new-model-registration) for instructions on how to register your new model to be used by vLLM.
+See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM.
## Frequently Asked Questions
@@ -117,7 +116,7 @@ For models with interleaving sliding windows (e.g. `google/gemma-2-2b-it` and `m
To support a model with interleaving sliding windows, we need to take care of the following details:
-- Make sure [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/config.py#L308) evaluates `has_interleaved_attention` to `True` for this model, and set `self.hf_text_config.interleaved_sliding_window` to the format of interleaving sliding windows the model can understand. Then, `self.hf_text_config.sliding_window` will be deleted, and the model will be treated as a full-attention model.
+- Make sure the model's `config.json` contains `sliding_window_pattern`. vLLM then sets `self.hf_text_config.interleaved_sliding_window` to the value of `self.hf_text_config.sliding_window` and deletes `sliding_window` from `self.hf_text_config`. The model will then be treated as a full-attention model.
- In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171).
With these two steps, interleave sliding windows should work with the model.
diff --git a/docs/contributing/model/multimodal.md b/docs/contributing/model/multimodal.md
new file mode 100644
index 000000000000..892ab9098407
--- /dev/null
+++ b/docs/contributing/model/multimodal.md
@@ -0,0 +1,803 @@
+---
+title: Multi-Modal Support
+---
+[](){ #supports-multimodal }
+
+This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs].
+
+## 1. Update the base vLLM model
+
+It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
+Further update the model as follows:
+
+- Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example:
+
+ ```diff
+ def forward(
+ self,
+ input_ids: torch.Tensor,
+ positions: torch.Tensor,
+ + pixel_values: torch.Tensor,
+ ) -> SamplerOutput:
+ ```
+
+ More conveniently, you can simply pass `**kwargs` to the [forward][torch.nn.Module.forward] method and retrieve the keyword parameters for multimodal inputs from it.
+
+- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
+
+ ```python
+ class YourModelForImage2Seq(nn.Module):
+ ...
+
+ def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
+
+ assert self.vision_encoder is not None
+ image_features = self.vision_encoder(image_input)
+ return self.multi_modal_projector(image_features)
+
+ def get_multimodal_embeddings(
+ self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
+
+ # Validate the multimodal input keyword arguments
+ image_input = self._parse_and_validate_image_input(**kwargs)
+ if image_input is None:
+ return None
+
+ # Run multimodal inputs through encoder and projector
+ vision_embeddings = self._process_image_input(image_input)
+ return vision_embeddings
+ ```
+
+!!! warning
+ The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
+
+- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
+
+ ```python
+ from .utils import merge_multimodal_embeddings
+
+ class YourModelForImage2Seq(nn.Module):
+ ...
+
+ def get_input_embeddings(
+ self,
+ input_ids: torch.Tensor,
+ multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
+ ) -> torch.Tensor:
+
+ # `get_input_embeddings` should already be implemented for the language
+ # model as one of the requirements of basic vLLM model implementation.
+ inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+
+ if multimodal_embeddings is not None:
+ inputs_embeds = merge_multimodal_embeddings(
+ input_ids=input_ids,
+ inputs_embeds=inputs_embeds,
+ multimodal_embeddings=multimodal_embeddings,
+ placeholder_token_id=self.config.image_token_index)
+
+ return inputs_embeds
+ ```
+
+- Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.
+
+ ```python
+ class YourModelForImage2Seq(nn.Module):
+ ...
+
+ def get_language_model(self) -> torch.nn.Module:
+ # Change `language_model` according to your implementation.
+ return self.language_model
+ ```
+
+- Once the above steps are done, update the model class with the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
+
+ ```diff
+ + from vllm.model_executor.models.interfaces import SupportsMultiModal
+
+ - class YourModelForImage2Seq(nn.Module):
+ + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
+ ```
+
+!!! note
+ The model class does not have to be named `*ForCausalLM`.
+ Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
+
+## 2. Specify processing information
+
+Next, create a subclass of [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo]
+to provide basic information related to HF processing.
+
+### Maximum number of input items
+
+You need to override the abstract method [get_supported_mm_limits][vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits]
+to return the maximum number of input items for each modality supported by the model.
+
+For example, if the model supports any number of images but only one video per prompt:
+
+```python
+def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
+ return {"image": None, "video": 1}
+```
+
+## 3. Specify dummy inputs
+
+Then, inherit [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] to construct dummy inputs for
+HF processing as well as memory profiling.
+
+### For memory profiling
+
+Override the abstract methods [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text] and [get_dummy_mm_data][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data] to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
+
+Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
+
+=== "Basic example: LLaVA"
+
+ Looking at the code of HF's `LlavaForConditionalGeneration`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
+ n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
+ n_image_features = image_features.shape[0] * image_features.shape[1]
+
+ if n_image_tokens != n_image_features:
+ raise ValueError(
+ f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+ )
+ special_image_mask = (
+ (input_ids == self.config.image_token_index)
+ .unsqueeze(-1)
+ .expand_as(inputs_embeds)
+ .to(inputs_embeds.device)
+ )
+ image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+ inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+ ```
+
+ The number of placeholder feature tokens per image is `image_features.shape[1]`.
+ `image_features` is calculated inside the `get_image_features` method:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
+ image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+
+ selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
+ if vision_feature_select_strategy == "default":
+ selected_image_feature = selected_image_feature[:, 1:]
+ elif vision_feature_select_strategy == "full":
+ selected_image_feature = selected_image_feature
+ else:
+ raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
+ image_features = self.multi_modal_projector(selected_image_feature)
+ return image_features
+ ```
+
+ We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
+ (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
+ Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`.
+ The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention
+ mechanism doesn't change the sequence length of the output hidden states.
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
+ hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
+ hidden_states = self.pre_layrnorm(hidden_states)
+
+ encoder_outputs = self.encoder(
+ inputs_embeds=hidden_states,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+ ```
+
+ To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
+ target_dtype = self.patch_embedding.weight.dtype
+ patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+ if interpolate_pos_encoding:
+ embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
+ else:
+ embeddings = embeddings + self.position_embedding(self.position_ids)
+ return embeddings
+ ```
+
+ We can infer that `embeddings.shape[1] == self.num_positions`, where
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
+ self.num_patches = (self.image_size // self.patch_size) ** 2
+ self.num_positions = self.num_patches + 1
+ ```
+
+ Overall, the number of placeholder feature tokens for an image can be calculated as:
+
+ ```python
+ def get_num_image_tokens(
+ self,
+ *,
+ image_width: int,
+ image_height: int,
+ ) -> int:
+ hf_config = self.get_hf_config()
+ hf_processor = self.get_hf_processor()
+
+ image_size = hf_config.vision_config.image_size
+ patch_size = hf_config.vision_config.patch_size
+
+ num_image_tokens = (image_size // patch_size) ** 2 + 1
+ if hf_processor.vision_feature_select_strategy == "default":
+ num_image_tokens -= 1
+
+ return num_image_tokens
+ ```
+
+ Notice that the number of image tokens doesn't depend on the image width and height.
+ We can simply use a dummy `image_size` to calculate the multimodal profiling data:
+
+ ```python
+ # NOTE: In actuality, this is usually implemented as part of the
+ # model's subclass of `BaseProcessingInfo`, but we show it as is
+ # here for simplicity.
+ def get_image_size_with_most_features(self) -> ImageSize:
+ hf_config = self.get_hf_config()
+ width = height = hf_config.image_size
+ return ImageSize(width=width, height=height)
+
+ def get_dummy_mm_data(
+ self,
+ seq_len: int,
+ mm_counts: Mapping[str, int],
+ ) -> MultiModalDataDict:
+ num_images = mm_counts.get("image", 0)
+
+ target_width, target_height = \
+ self.info.get_image_size_with_most_features()
+
+ return {
+ "image":
+ self._get_dummy_images(width=target_width,
+ height=target_height,
+ num_images=num_images)
+ }
+ ```
+
+ For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
+
+ ```python
+ def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+ num_images = mm_counts.get("image", 0)
+
+ processor = self.info.get_hf_processor()
+ image_token = processor.image_token
+
+ return image_token * num_images
+ ```
+
+=== "No input placeholders: Fuyu"
+
+ Looking at the code of HF's `FuyuForCausalLM`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
+ if image_patches is not None and past_key_values is None:
+ patch_embeddings = [
+ self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
+ .squeeze(0)
+ .to(inputs_embeds.device)
+ for patch in image_patches
+ ]
+ inputs_embeds = self.gather_continuous_embeddings(
+ word_embeddings=inputs_embeds,
+ continuous_embeddings=patch_embeddings,
+ image_patch_input_indices=image_patches_indices,
+ )
+ ```
+
+ The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
+ which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
+
+ Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
+ Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
+
+ The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
+ `FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
+
+ In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
+ returning the dimensions after resizing (but before padding) as metadata.
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
+ image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
+ batch_images = image_encoding["images"]
+ image_unpadded_heights = image_encoding["image_unpadded_heights"]
+ image_unpadded_widths = image_encoding["image_unpadded_widths"]
+
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
+ if do_resize:
+ batch_images = [
+ [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
+ for images in batch_images
+ ]
+
+ image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
+ image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
+ image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
+
+ if do_pad:
+ batch_images = [
+ [
+ self.pad_image(
+ image,
+ size=size,
+ mode=padding_mode,
+ constant_values=padding_value,
+ input_data_format=input_data_format,
+ )
+ for image in images
+ ]
+ for images in batch_images
+ ]
+ ```
+
+ In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
+ model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+ image_input=tensor_batch_images,
+ image_present=image_present,
+ image_unpadded_h=image_unpadded_heights,
+ image_unpadded_w=image_unpadded_widths,
+ image_placeholder_id=image_placeholder_id,
+ image_newline_id=image_newline_id,
+ variable_sized=True,
+ )
+
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
+ image_height, image_width = image.shape[1], image.shape[2]
+ if variable_sized: # variable_sized=True
+ new_h = min(
+ image_height,
+ math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+ )
+ new_w = min(
+ image_width,
+ math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+ )
+ image = image[:, :new_h, :new_w]
+ image_height, image_width = new_h, new_w
+
+ num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
+ tensor_of_image_ids = torch.full(
+ [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+ )
+ patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+ assert num_patches == patches.shape[0]
+ ```
+
+ The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
+ patch_size = patch_size if patch_size is not None else self.patch_size
+ patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
+
+ if image_height % patch_height != 0:
+ raise ValueError(f"{image_height=} must be divisible by {patch_height}")
+ if image_width % patch_width != 0:
+ raise ValueError(f"{image_width=} must be divisible by {patch_width}")
+
+ num_patches_per_dim_h = image_height // patch_height
+ num_patches_per_dim_w = image_width // patch_width
+ num_patches = num_patches_per_dim_h * num_patches_per_dim_w
+ ```
+
+ These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
+ to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
+
+ ```python
+ def get_image_size_with_most_features(self) -> ImageSize:
+ image_processor = self.get_image_processor()
+ return ImageSize(width=image_processor.size["width"],
+ height=image_processor.size["height"])
+ ```
+
+ Fuyu does not expect image placeholders in the inputs to HF processor, so
+ the dummy prompt text is empty regardless of the number of images.
+
+ ```python
+ def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+ return ""
+ ```
+
+ For the multimodal image profiling data, the logic is very similar to LLaVA:
+
+ ```python
+ def get_dummy_mm_data(
+ self,
+ seq_len: int,
+ mm_counts: Mapping[str, int],
+ ) -> MultiModalDataDict:
+ target_width, target_height = \
+ self.info.get_image_size_with_most_features()
+ num_images = mm_counts.get("image", 0)
+
+ return {
+ "image":
+ self._get_dummy_images(width=target_width,
+ height=target_height,
+ num_images=num_images)
+ }
+ ```
+
+## 4. Specify processing details
+
+Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]
+to fill in the missing details about HF processing.
+
+!!! info
+ [Multi-Modal Data Processing][mm-processing]
+
+### Multi-modal fields
+
+Override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] to
+return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
+
+=== "Basic example: LLaVA"
+
+ The output of `CLIPImageProcessor` is a simple tensor with shape
+ `(num_images, num_channels, image_height, image_width)`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
+ images = [
+ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+ for image in all_images
+ ]
+
+ data = {"pixel_values": images}
+ return BatchFeature(data=data, tensor_type=return_tensors)
+ ```
+
+ So, we override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
+
+ ```python
+ def _get_mm_fields_config(
+ self,
+ hf_inputs: BatchFeature,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ ) -> Mapping[str, MultiModalFieldConfig]:
+ return dict(
+ pixel_values=MultiModalFieldConfig.batched("image"),
+ )
+ ```
+
+ !!! note
+ Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
+ pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
+
+=== "With postprocessing: Fuyu"
+
+ The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
+ the patches from each image belonging to an item in the batch:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
+ image_input_ids.append(tensor_of_image_ids)
+ image_patches.append(patches)
+ else:
+ image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
+
+ batch_image_input_ids.append(image_input_ids)
+ batch_image_patches.append(image_patches)
+ ```
+
+ The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
+ `(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
+
+ In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
+ we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
+
+ ```python
+ def _call_hf_processor(
+ self,
+ prompt: str,
+ mm_data: Mapping[str, object],
+ mm_kwargs: Mapping[str, object],
+ ) -> BatchFeature:
+ processed_outputs = super()._call_hf_processor(
+ prompt=prompt,
+ mm_data=mm_data,
+ mm_kwargs=mm_kwargs,
+ )
+
+ image_patches = processed_outputs.get("image_patches")
+ if image_patches is not None:
+ images = mm_data["images"]
+ assert isinstance(images, list)
+
+ # Original output: (1, num_images, Pn, Px * Py * C)
+ # New output: (num_images, Pn, Px * Py * C)
+ assert (isinstance(image_patches, list)
+ and len(image_patches) == 1)
+ assert (isinstance(image_patches[0], torch.Tensor)
+ and len(image_patches[0]) == len(images))
+
+ processed_outputs["image_patches"] = image_patches[0]
+
+ return processed_outputs
+ ```
+
+ !!! note
+ Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
+ for text-only inputs to prevent unnecessary warnings from HF processor.
+
+ This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
+
+ ```python
+ def _get_mm_fields_config(
+ self,
+ hf_inputs: BatchFeature,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ ) -> Mapping[str, MultiModalFieldConfig]:
+ return dict(image_patches=MultiModalFieldConfig.batched("image"))
+ ```
+
+### Prompt updates
+
+Override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] to
+return a list of [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instances.
+
+Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies an update operation
+(e.g.: insertion, replacement) performed by the HF processor.
+
+=== "Basic example: LLaVA"
+
+ Looking at HF's `LlavaProcessor`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
+ prompt_strings = []
+ for sample in text:
+ sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
+ prompt_strings.append(sample)
+ ```
+
+ It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
+ Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
+
+ ```python
+ def _get_prompt_updates(
+ self,
+ mm_items: MultiModalDataItems,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ out_mm_kwargs: MultiModalKwargs,
+ ) -> Sequence[PromptUpdate]:
+ hf_config = self.info.get_hf_config()
+ image_token_id = hf_config.image_token_index
+
+ def get_replacement(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+
+ image_size = images.get_image_size(item_idx)
+ num_image_tokens = self.info.get_num_image_tokens(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+
+ return [image_token_id] * num_image_tokens
+
+ return [
+ PromptReplacement(
+ modality="image",
+ target=[image_token_id],
+ replacement=get_replacement,
+ ),
+ ]
+ ```
+
+=== "Handling additional tokens: Fuyu"
+
+ Recall the layout of feature tokens from Step 2:
+
+ ```
+ |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+ |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+ ...
+ |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+ ```
+
+ We define a helper function to return `ncols` and `nrows` directly:
+
+ ```python
+ def get_image_feature_grid_size(
+ self,
+ *,
+ image_width: int,
+ image_height: int,
+ ) -> tuple[int, int]:
+ image_processor = self.get_image_processor()
+ target_width = image_processor.size["width"]
+ target_height = image_processor.size["height"]
+ patch_width = image_processor.patch_size["width"]
+ patch_height = image_processor.patch_size["height"]
+
+ if not (image_width <= target_width and image_height <= target_height):
+ height_scale_factor = target_height / image_height
+ width_scale_factor = target_width / image_width
+ optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+ image_height = int(image_height * optimal_scale_factor)
+ image_width = int(image_width * optimal_scale_factor)
+
+ ncols = math.ceil(image_width / patch_width)
+ nrows = math.ceil(image_height / patch_height)
+ return ncols, nrows
+ ```
+
+ Based on this, we can initially define our replacement tokens as:
+
+ ```python
+ def get_replacement(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+ image_size = images.get_image_size(item_idx)
+
+ ncols, nrows = self.info.get_image_feature_grid_size(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+
+ # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
+ # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
+ return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
+ ```
+
+ However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
+ a BOS token (``) is also added to the promopt:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
+ model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+ image_input=tensor_batch_images,
+ image_present=image_present,
+ image_unpadded_h=image_unpadded_heights,
+ image_unpadded_w=image_unpadded_widths,
+ image_placeholder_id=image_placeholder_id,
+ image_newline_id=image_newline_id,
+ variable_sized=True,
+ )
+ prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
+ tokenizer=self.tokenizer,
+ prompts=prompts,
+ scale_factors=scale_factors,
+ max_tokens_to_generate=self.max_tokens_to_generate,
+ max_position_embeddings=self.max_position_embeddings,
+ add_BOS=True,
+ add_beginning_of_answer_token=True,
+ )
+ ```
+
+ To assign the vision embeddings to only the image tokens, instead of a string
+ you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
+
+ ```python
+ hf_config = self.info.get_hf_config()
+ bos_token_id = hf_config.bos_token_id # ``
+ assert isinstance(bos_token_id, int)
+
+ def get_replacement_fuyu(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+ image_size = images.get_image_size(item_idx)
+
+ ncols, nrows = self.info.get_image_feature_grid_size(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+ image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+ [_NEWLINE_TOKEN_ID]) * nrows
+
+ return PromptUpdateDetails.select_token_id(
+ image_tokens + [bos_token_id],
+ embed_token_id=_IMAGE_TOKEN_ID,
+ )
+ ```
+
+ Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
+ we can search for it to conduct the replacement at the start of the string:
+
+ ```python
+ def _get_prompt_updates(
+ self,
+ mm_items: MultiModalDataItems,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ out_mm_kwargs: MultiModalKwargs,
+ ) -> Sequence[PromptUpdate]:
+ hf_config = self.info.get_hf_config()
+ bos_token_id = hf_config.bos_token_id
+ assert isinstance(bos_token_id, int)
+
+ tokenizer = self.info.get_tokenizer()
+ eot_token_id = tokenizer.bos_token_id
+ assert isinstance(eot_token_id, int)
+
+ def get_replacement_fuyu(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+ image_size = images.get_image_size(item_idx)
+
+ ncols, nrows = self.info.get_image_feature_grid_size(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+ image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+ [_NEWLINE_TOKEN_ID]) * nrows
+
+ return PromptUpdateDetails.select_token_id(
+ image_tokens + [bos_token_id],
+ embed_token_id=_IMAGE_TOKEN_ID,
+ )
+
+ return [
+ PromptReplacement(
+ modality="image",
+ target=[eot_token_id],
+ replacement=get_replacement_fuyu,
+ )
+ ]
+ ```
+
+## 5. Register processor-related classes
+
+After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
+[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
+and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
+decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor `
+to register them to the multi-modal registry:
+
+```diff
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
++ from vllm.multimodal import MULTIMODAL_REGISTRY
+
++ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
++ info=YourProcessingInfo,
++ dummy_inputs=YourDummyInputsBuilder)
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
+```
+
+## Notes
+
+### Inserting feature tokens without replacement
+
+Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use [PromptInsertion][vllm.multimodal.processing.PromptInsertion] instead of [PromptReplacement][vllm.multimodal.processing.PromptReplacement] inside [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
+
+Examples:
+
+- BLIP-2 (insert at start of prompt):
+- Florence2 (insert at start of prompt):
+- Molmo (insert after `<|endoftext|>` token):
+
+### Handling prompt updates unrelated to multi-modal data
+
+[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing].
+
+Examples:
+
+- Chameleon (appends `sep_token`):
+- Fuyu (appends `boa_token`):
+- Molmo (applies chat template which is not defined elsewhere):
+
+### Custom HF processor
+
+Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to [_call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor].
+
+Examples:
+
+- DeepSeek-VL2:
+- InternVL:
+- Qwen-VL:
diff --git a/docs/contributing/model/registration.md b/docs/contributing/model/registration.md
new file mode 100644
index 000000000000..7a7bd7914058
--- /dev/null
+++ b/docs/contributing/model/registration.md
@@ -0,0 +1,54 @@
+---
+title: Registering a Model to vLLM
+---
+[](){ #new-model-registration }
+
+vLLM relies on a model registry to determine how to run each model.
+A list of pre-registered architectures can be found [here][supported-models].
+
+If your model is not on this list, you must register it to vLLM.
+This page provides detailed instructions on how to do so.
+
+## Built-in models
+
+To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
+This gives you the ability to modify the codebase and test your model.
+
+After you have implemented your model (see [tutorial][new-model-basic]), put it into the directory.
+Then, add your model class to `_VLLM_MODELS` in so that it is automatically registered upon importing vLLM.
+Finally, update our [list of supported models][supported-models] to promote your model!
+
+!!! warning
+ The list of models in each section should be maintained in alphabetical order.
+
+## Out-of-tree models
+
+You can load an external model [using a plugin][plugin-system] without modifying the vLLM codebase.
+
+To register the model, use the following code:
+
+```python
+# The entrypoint of your plugin
+def register():
+ from vllm import ModelRegistry
+ from your_code import YourModelForCausalLM
+
+ ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
+```
+
+If your model imports modules that initialize CUDA, consider lazy-importing it to avoid errors like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
+
+```python
+# The entrypoint of your plugin
+def register():
+ from vllm import ModelRegistry
+
+ ModelRegistry.register_model(
+ "YourModelForCausalLM",
+ "your_code:YourModelForCausalLM"
+ )
+```
+
+!!! warning
+ If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
+ Read more about that [here][supports-multimodal].
diff --git a/docs/source/contributing/model/tests.md b/docs/contributing/model/tests.md
similarity index 74%
rename from docs/source/contributing/model/tests.md
rename to docs/contributing/model/tests.md
index 68d51d89f7cf..67f8eda61dc5 100644
--- a/docs/source/contributing/model/tests.md
+++ b/docs/contributing/model/tests.md
@@ -1,6 +1,7 @@
-(new-model-tests)=
-
-# Writing Unit Tests
+---
+title: Writing Unit Tests
+---
+[](){ #new-model-tests }
This page explains how to write unit tests to verify the implementation of your model.
@@ -14,14 +15,12 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in .
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
-:::{important}
-The list of models in each section should be maintained in alphabetical order.
-:::
+!!! warning
+ The list of models in each section should be maintained in alphabetical order.
-:::{tip}
-If your model requires a development version of HF Transformers, you can set
-`min_transformers_version` to skip the test in CI until the model is released.
-:::
+!!! tip
+ If your model requires a development version of HF Transformers, you can set
+ `min_transformers_version` to skip the test in CI until the model is released.
## Optional Tests
@@ -34,16 +33,16 @@ These tests compare the model outputs of vLLM against [HF Transformers](https://
#### Generative models
-For [generative models](#generative-models), there are two levels of correctness tests, as defined in :
+For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in :
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models
-For [pooling models](#pooling-models), we simply check the cosine similarity, as defined in .
+For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in .
-(mm-processing-tests)=
+[](){ #mm-processing-tests }
### Multi-modal processing
diff --git a/docs/source/contributing/profiling/profiling_index.md b/docs/contributing/profiling.md
similarity index 90%
rename from docs/source/contributing/profiling/profiling_index.md
rename to docs/contributing/profiling.md
index ce25daa39c5c..be01b9b65f65 100644
--- a/docs/source/contributing/profiling/profiling_index.md
+++ b/docs/contributing/profiling.md
@@ -1,8 +1,7 @@
# Profiling vLLM
-:::{warning}
-Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
-:::
+!!! warning
+ Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
## Profile with PyTorch Profiler
@@ -14,15 +13,13 @@ When using `benchmarks/benchmark_serving.py`, you can enable profiling by passin
Traces can be visualized using .
-:::{tip}
-Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
-:::
+!!! tip
+ Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
-:::{tip}
-To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
-Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
-`export VLLM_RPC_TIMEOUT=1800000`
-:::
+!!! tip
+ To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
+ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
+ `export VLLM_RPC_TIMEOUT=1800000`
### Example commands and usage
diff --git a/docs/source/contributing/vulnerability_management.md b/docs/contributing/vulnerability_management.md
similarity index 100%
rename from docs/source/contributing/vulnerability_management.md
rename to docs/contributing/vulnerability_management.md
diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md
new file mode 100644
index 000000000000..516640f6fd3c
--- /dev/null
+++ b/docs/deployment/docker.md
@@ -0,0 +1,129 @@
+---
+title: Using Docker
+---
+[](){ #deployment-docker }
+
+[](){ #deployment-docker-pre-built-image }
+
+## Use vLLM's Official Docker Image
+
+vLLM offers an official Docker image for deployment.
+The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
+
+```console
+docker run --runtime nvidia --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ --env "HUGGING_FACE_HUB_TOKEN=" \
+ -p 8000:8000 \
+ --ipc=host \
+ vllm/vllm-openai:latest \
+ --model mistralai/Mistral-7B-v0.1
+```
+
+This image can also be used with other container engines such as [Podman](https://podman.io/).
+
+```console
+podman run --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+ -p 8000:8000 \
+ --ipc=host \
+ vllm/vllm-openai:latest \
+ --model mistralai/Mistral-7B-v0.1
+```
+
+You can add any other [engine-args][engine-args] you need after the image tag (`vllm/vllm-openai:latest`).
+
+!!! note
+ You can either use the `ipc=host` flag or `--shm-size` flag to allow the
+ container to access the host's shared memory. vLLM uses PyTorch, which uses shared
+ memory to share data between processes under the hood, particularly for tensor parallel inference.
+
+!!! note
+ Optional dependencies are not included in order to avoid licensing issues (e.g. ).
+
+ If you need to use those dependencies (having accepted the license terms),
+ create a custom Dockerfile on top of the base image with an extra layer that installs them:
+
+ ```Dockerfile
+ FROM vllm/vllm-openai:v0.8.3
+
+ # e.g. install the `audio` optional dependencies
+ # NOTE: Make sure the version of vLLM matches the base image!
+ RUN uv pip install --system vllm[audio]==0.8.3
+ ```
+
+!!! tip
+ Some new models may only be available on the main branch of [HF Transformers](https://github.com/huggingface/transformers).
+
+ To use the development version of `transformers`, create a custom Dockerfile on top of the base image
+ with an extra layer that installs their code from source:
+
+ ```Dockerfile
+ FROM vllm/vllm-openai:latest
+
+ RUN uv pip install --system git+https://github.com/huggingface/transformers.git
+ ```
+
+[](){ #deployment-docker-build-image-from-source }
+
+## Building vLLM's Docker Image from Source
+
+You can build and run vLLM from source via the provided . To build vLLM:
+
+```console
+# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
+DOCKER_BUILDKIT=1 docker build . \
+ --target vllm-openai \
+ --tag vllm/vllm-openai \
+ --file docker/Dockerfile
+```
+
+!!! note
+ By default vLLM will build for all GPU types for widest distribution. If you are just building for the
+ current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
+ for vLLM to find the current GPU type and build for that.
+
+ If you are using Podman instead of Docker, you might need to disable SELinux labeling by
+ adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
+
+## Building for Arm64/aarch64
+
+A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
+of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
+
+!!! note
+ Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
+ flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
+ Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
+
+```console
+# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
+python3 use_existing_torch.py
+DOCKER_BUILDKIT=1 docker build . \
+ --file docker/Dockerfile \
+ --target vllm-openai \
+ --platform "linux/arm64" \
+ -t vllm/vllm-gh200-openai:latest \
+ --build-arg max_jobs=66 \
+ --build-arg nvcc_threads=2 \
+ --build-arg torch_cuda_arch_list="9.0+PTX" \
+ --build-arg vllm_fa_cmake_gpu_arches="90-real"
+```
+
+## Use the custom-built vLLM Docker image
+
+To run vLLM with the custom-built Docker image:
+
+```console
+docker run --runtime nvidia --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ -p 8000:8000 \
+ --env "HUGGING_FACE_HUB_TOKEN=" \
+ vllm/vllm-openai
+```
+
+The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
+
+!!! note
+ **For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
diff --git a/docs/source/deployment/frameworks/anything-llm.md b/docs/deployment/frameworks/anything-llm.md
similarity index 78%
rename from docs/source/deployment/frameworks/anything-llm.md
rename to docs/deployment/frameworks/anything-llm.md
index d430c170ef54..a89e633c086e 100644
--- a/docs/source/deployment/frameworks/anything-llm.md
+++ b/docs/deployment/frameworks/anything-llm.md
@@ -1,6 +1,7 @@
-(deployment-anything-llm)=
-
-# Anything LLM
+---
+title: Anything LLM
+---
+[](){ #deployment-anything-llm }
[Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.
@@ -25,23 +26,19 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
- Base URL: http://{vllm server host}:{vllm server port}/v1
- Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ`
-:::{image} /assets/deployment/anything-llm-provider.png
-:::
+
- Back to home page, New Workspace --> create `vllm` workspace, and start to chat:
-:::{image} /assets/deployment/anything-llm-chat-without-doc.png
-:::
+
- Click the upload button:
- upload the doc
- select the doc and move to the workspace
- save and embed
-:::{image} /assets/deployment/anything-llm-upload-doc.png
-:::
+
- Chat again:
-:::{image} /assets/deployment/anything-llm-chat-with-doc.png
-:::
+
diff --git a/docs/deployment/frameworks/autogen.md b/docs/deployment/frameworks/autogen.md
new file mode 100644
index 000000000000..ad8c167659ef
--- /dev/null
+++ b/docs/deployment/frameworks/autogen.md
@@ -0,0 +1,83 @@
+---
+title: AutoGen
+---
+[](){ #deployment-autogen }
+
+[AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans.
+
+## Prerequisites
+
+- Setup vLLM environment
+
+- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
+
+```console
+pip install vllm
+
+# Install AgentChat and OpenAI client from Extensions
+# AutoGen requires Python 3.10 or later.
+pip install -U "autogen-agentchat" "autogen-ext[openai]"
+```
+
+## Deploy
+
+- Start the vLLM server with the supported chat completion model, e.g.
+
+```console
+python -m vllm.entrypoints.openai.api_server \
+ --model mistralai/Mistral-7B-Instruct-v0.2
+```
+
+- Call it with AutoGen:
+
+```python
+import asyncio
+from autogen_core.models import UserMessage
+from autogen_ext.models.openai import OpenAIChatCompletionClient
+from autogen_core.models import ModelFamily
+
+
+async def main() -> None:
+ # Create a model client
+ model_client = OpenAIChatCompletionClient(
+ model="mistralai/Mistral-7B-Instruct-v0.2",
+ base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
+ api_key="EMPTY",
+ model_info={
+ "vision": False,
+ "function_calling": False,
+ "json_output": False,
+ "family": ModelFamily.MISTRAL,
+ "structured_output": True,
+ },
+ )
+
+ messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
+
+ # Create a stream.
+ stream = model_client.create_stream(messages=messages)
+
+ # Iterate over the stream and print the responses.
+ print("Streamed responses:")
+ async for response in stream:
+ if isinstance(response, str):
+ # A partial response is a string.
+ print(response, flush=True, end="")
+ else:
+ # The last response is a CreateResult object with the complete message.
+ print("\n\n------------\n")
+ print("The complete response:", flush=True)
+ print(response.content, flush=True)
+
+ # Close the client when done.
+ await model_client.close()
+
+
+asyncio.run(main())
+```
+
+For details, see the tutorial:
+
+- [Using vLLM in AutoGen](https://microsoft.github.io/autogen/0.2/docs/topics/non-openai-models/local-vllm/)
+
+- [OpenAI-compatible API examples](https://microsoft.github.io/autogen/stable/reference/python/autogen_ext.models.openai.html#autogen_ext.models.openai.OpenAIChatCompletionClient)
diff --git a/docs/source/deployment/frameworks/bentoml.md b/docs/deployment/frameworks/bentoml.md
similarity index 89%
rename from docs/source/deployment/frameworks/bentoml.md
rename to docs/deployment/frameworks/bentoml.md
index 2bf435bda838..7e64b6eb6fb0 100644
--- a/docs/source/deployment/frameworks/bentoml.md
+++ b/docs/deployment/frameworks/bentoml.md
@@ -1,6 +1,7 @@
-(deployment-bentoml)=
-
-# BentoML
+---
+title: BentoML
+---
+[](){ #deployment-bentoml }
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.
diff --git a/docs/source/deployment/frameworks/cerebrium.md b/docs/deployment/frameworks/cerebrium.md
similarity index 98%
rename from docs/source/deployment/frameworks/cerebrium.md
rename to docs/deployment/frameworks/cerebrium.md
index b20c95137b6e..84cb2304fac2 100644
--- a/docs/source/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -1,12 +1,11 @@
-(deployment-cerebrium)=
+---
+title: Cerebrium
+---
+[](){ #deployment-cerebrium }
-# Cerebrium
-
-:::{raw} html
-:::
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
diff --git a/docs/source/deployment/frameworks/chatbox.md b/docs/deployment/frameworks/chatbox.md
similarity index 84%
rename from docs/source/deployment/frameworks/chatbox.md
rename to docs/deployment/frameworks/chatbox.md
index e62f4647150f..10da2fc71002 100644
--- a/docs/source/deployment/frameworks/chatbox.md
+++ b/docs/deployment/frameworks/chatbox.md
@@ -1,6 +1,7 @@
-(deployment-chatbox)=
-
-# Chatbox
+---
+title: Chatbox
+---
+[](){ #deployment-chatbox }
[Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.
@@ -27,10 +28,8 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
- API Path: `/chat/completions`
- Model: `qwen/Qwen1.5-0.5B-Chat`
-:::{image} /assets/deployment/chatbox-settings.png
-:::
+
- Go to `Just chat`, and start to chat:
-:::{image} /assets/deployment/chatbox-chat.png
-:::
+
diff --git a/docs/source/deployment/frameworks/dify.md b/docs/deployment/frameworks/dify.md
similarity index 90%
rename from docs/source/deployment/frameworks/dify.md
rename to docs/deployment/frameworks/dify.md
index 5cdf6a387637..886484b54347 100644
--- a/docs/source/deployment/frameworks/dify.md
+++ b/docs/deployment/frameworks/dify.md
@@ -1,6 +1,7 @@
-(deployment-dify)=
-
-# Dify
+---
+title: Dify
+---
+[](){ #deployment-dify }
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.
@@ -42,15 +43,12 @@ docker compose up -d
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
- **Completion Mode**: `Completion`
-:::{image} /assets/deployment/dify-settings.png
-:::
+
- To create a test chatbot, go to `Studio โ Chatbot โ Create from Blank`, then select Chatbot as the type:
-:::{image} /assets/deployment/dify-create-chatbot.png
-:::
+
- Click the chatbot you just created to open the chat interface and start interacting with the model:
-:::{image} /assets/deployment/dify-chat.png
-:::
+
diff --git a/docs/source/deployment/frameworks/dstack.md b/docs/deployment/frameworks/dstack.md
similarity index 83%
rename from docs/source/deployment/frameworks/dstack.md
rename to docs/deployment/frameworks/dstack.md
index a16e28f2d898..7de92855745b 100644
--- a/docs/source/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -1,12 +1,11 @@
-(deployment-dstack)=
+---
+title: dstack
+---
+[](){ #deployment-dstack }
-# dstack
-
-:::{raw} html
-:::
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
@@ -97,6 +96,5 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-:::{note}
-dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
-:::
+!!! note
+ dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
diff --git a/docs/deployment/frameworks/haystack.md b/docs/deployment/frameworks/haystack.md
new file mode 100644
index 000000000000..2eac4a5279fd
--- /dev/null
+++ b/docs/deployment/frameworks/haystack.md
@@ -0,0 +1,60 @@
+---
+title: Haystack
+---
+[](){ #deployment-haystack }
+
+# Haystack
+
+[Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case.
+
+It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints.
+
+## Prerequisites
+
+- Setup vLLM and Haystack environment
+
+```console
+pip install vllm haystack-ai
+```
+
+## Deploy
+
+- Start the vLLM server with the supported chat completion model, e.g.
+
+```console
+vllm serve mistralai/Mistral-7B-Instruct-v0.1
+```
+
+- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
+
+```python
+from haystack.components.generators.chat import OpenAIChatGenerator
+from haystack.dataclasses import ChatMessage
+from haystack.utils import Secret
+
+generator = OpenAIChatGenerator(
+ # for compatibility with the OpenAI API, a placeholder api_key is needed
+ api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
+ model="mistralai/Mistral-7B-Instruct-v0.1",
+ api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
+ generation_kwargs = {"max_tokens": 512}
+)
+
+response = generator.run(
+ messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
+)
+
+print("-"*30)
+print(response)
+print("-"*30)
+```
+
+Output e.g.:
+
+```console
+------------------------------
+{'replies': [ChatMessage(_role=, _content=[TextContent(text=' Of course! Where in Italy would you like to go and what type of trip are you looking to plan?')], _name=None, _meta={'model': 'mistralai/Mistral-7B-Instruct-v0.1', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 23, 'prompt_tokens': 21, 'total_tokens': 44, 'completion_tokens_details': None, 'prompt_tokens_details': None}})]}
+------------------------------
+```
+
+For details, see the tutorial [Using vLLM in Haystack](https://github.com/deepset-ai/haystack-integrations/blob/main/integrations/vllm.md).
diff --git a/docs/deployment/frameworks/helm.md b/docs/deployment/frameworks/helm.md
new file mode 100644
index 000000000000..192b90438acf
--- /dev/null
+++ b/docs/deployment/frameworks/helm.md
@@ -0,0 +1,95 @@
+---
+title: Helm
+---
+[](){ #deployment-helm }
+
+A Helm chart to deploy vLLM for Kubernetes
+
+Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
+
+This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file.
+
+## Prerequisites
+
+Before you begin, ensure that you have the following:
+
+- A running Kubernetes cluster
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
+- Available GPU resources in your cluster
+- S3 with the model which will be deployed
+
+## Installing the chart
+
+To install the chart with the release name `test-vllm`:
+
+```console
+helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
+```
+
+## Uninstalling the Chart
+
+To uninstall the `test-vllm` deployment:
+
+```console
+helm uninstall test-vllm --namespace=ns-vllm
+```
+
+The command removes all the Kubernetes components associated with the
+chart **including persistent volumes** and deletes the release.
+
+## Architecture
+
+
+
+## Values
+
+| Key | Type | Default | Description |
+|--------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
+| autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration |
+| autoscaling.enabled | bool | false | Enable autoscaling |
+| autoscaling.maxReplicas | int | 100 | Maximum replicas |
+| autoscaling.minReplicas | int | 1 | Minimum replicas |
+| autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling |
+| configs | object | {} | Configmap |
+| containerPort | int | 8000 | Container port |
+| customObjects | list | [] | Custom Objects configuration |
+| deploymentStrategy | object | {} | Deployment strategy configuration |
+| externalConfigs | list | [] | External configuration |
+| extraContainers | list | [] | Additional containers configuration |
+| extraInit | object | {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} | Additional configuration for the init container |
+| extraInit.pvcStorage | string | "50Gi" | Storage size of the s3 |
+| extraInit.s3modelpath | string | "relative_s3_model_path/opt-125m" | Path of the model on the s3 which hosts model weights and config files |
+| extraInit.awsEc2MetadataDisabled | boolean | true | Disables the use of the Amazon EC2 instance metadata service |
+| extraPorts | list | [] | Additional ports configuration |
+| gpuModels | list | ["TYPE_GPU_USED"] | Type of gpu used |
+| image | object | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration |
+| image.command | list | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] | Container launch command |
+| image.repository | string | "vllm/vllm-openai" | Image repository |
+| image.tag | string | "latest" | Image tag |
+| livenessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} | Liveness probe configuration |
+| livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive |
+| livenessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the Kubelet http request on the server |
+| livenessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
+| livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
+| livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated |
+| livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe |
+| maxUnavailablePodDisruptionBudget | string | "" | Disruption Budget Configuration |
+| readinessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} | Readiness probe configuration |
+| readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready |
+| readinessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the Kubelet http request on the server |
+| readinessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
+| readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
+| readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated |
+| readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe |
+| replicaCount | int | 1 | Number of replicas |
+| resources | object | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} | Resource configuration |
+| resources.limits."nvidia.com/gpu" | int | 1 | Number of gpus used |
+| resources.limits.cpu | int | 4 | Number of CPUs |
+| resources.limits.memory | string | "16Gi" | CPU memory configuration |
+| resources.requests."nvidia.com/gpu" | int | 1 | Number of gpus used |
+| resources.requests.cpu | int | 4 | Number of CPUs |
+| resources.requests.memory | string | "16Gi" | CPU memory configuration |
+| secrets | object | {} | Secrets configuration |
+| serviceName | string | Service name | |
+| servicePort | int | 80 | Service port |
+| labels.environment | string | test | Environment name |
diff --git a/docs/source/deployment/frameworks/litellm.md b/docs/deployment/frameworks/litellm.md
similarity index 97%
rename from docs/source/deployment/frameworks/litellm.md
rename to docs/deployment/frameworks/litellm.md
index 6dd3607ca5e3..3011cde83018 100644
--- a/docs/source/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@@ -1,6 +1,7 @@
-(deployment-litellm)=
-
-# LiteLLM
+---
+title: LiteLLM
+---
+[](){ #deployment-litellm }
[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]
diff --git a/docs/source/deployment/frameworks/lobe-chat.md b/docs/deployment/frameworks/lobe-chat.md
similarity index 89%
rename from docs/source/deployment/frameworks/lobe-chat.md
rename to docs/deployment/frameworks/lobe-chat.md
index 6d86b7fa9cce..cd95c028155e 100644
--- a/docs/source/deployment/frameworks/lobe-chat.md
+++ b/docs/deployment/frameworks/lobe-chat.md
@@ -1,6 +1,7 @@
-(deployment-lobe-chat)=
-
-# Lobe Chat
+---
+title: Lobe Chat
+---
+[](){ #deployment-lobe-chat }
[Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.
diff --git a/docs/source/deployment/frameworks/lws.md b/docs/deployment/frameworks/lws.md
similarity index 99%
rename from docs/source/deployment/frameworks/lws.md
rename to docs/deployment/frameworks/lws.md
index 4e9a03b5c4c1..18282a89ddff 100644
--- a/docs/source/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@@ -1,6 +1,7 @@
-(deployment-lws)=
-
-# LWS
+---
+title: LWS
+---
+[](){ #deployment-lws }
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.
diff --git a/docs/source/deployment/frameworks/modal.md b/docs/deployment/frameworks/modal.md
similarity index 85%
rename from docs/source/deployment/frameworks/modal.md
rename to docs/deployment/frameworks/modal.md
index e7c42088e36a..dbdb739a1000 100644
--- a/docs/source/deployment/frameworks/modal.md
+++ b/docs/deployment/frameworks/modal.md
@@ -1,6 +1,7 @@
-(deployment-modal)=
-
-# Modal
+---
+title: Modal
+---
+[](){ #deployment-modal }
vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.
diff --git a/docs/source/deployment/frameworks/open-webui.md b/docs/deployment/frameworks/open-webui.md
similarity index 87%
rename from docs/source/deployment/frameworks/open-webui.md
rename to docs/deployment/frameworks/open-webui.md
index 83e5303a00ef..1ab1931068fa 100644
--- a/docs/source/deployment/frameworks/open-webui.md
+++ b/docs/deployment/frameworks/open-webui.md
@@ -1,6 +1,7 @@
-(deployment-open-webui)=
-
-# Open WebUI
+---
+title: Open WebUI
+---
+[](){ #deployment-open-webui }
1. Install the [Docker](https://docs.docker.com/engine/install/)
@@ -25,5 +26,4 @@ ghcr.io/open-webui/open-webui:main
On the top of the web page, you can see the model `qwen/Qwen1.5-0.5B-Chat`.
-:::{image} /assets/deployment/open_webui.png
-:::
+
diff --git a/docs/source/deployment/frameworks/retrieval_augmented_generation.md b/docs/deployment/frameworks/retrieval_augmented_generation.md
similarity index 96%
rename from docs/source/deployment/frameworks/retrieval_augmented_generation.md
rename to docs/deployment/frameworks/retrieval_augmented_generation.md
index f84451fafe91..cb26c8378dee 100644
--- a/docs/source/deployment/frameworks/retrieval_augmented_generation.md
+++ b/docs/deployment/frameworks/retrieval_augmented_generation.md
@@ -1,6 +1,7 @@
-(deployment-retrieval-augmented-generation)=
-
-# Retrieval-Augmented Generation
+---
+title: Retrieval-Augmented Generation
+---
+[](){ #deployment-retrieval-augmented-generation }
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
diff --git a/docs/source/deployment/frameworks/skypilot.md b/docs/deployment/frameworks/skypilot.md
similarity index 94%
rename from docs/source/deployment/frameworks/skypilot.md
rename to docs/deployment/frameworks/skypilot.md
index 5e101b900103..9763745f2378 100644
--- a/docs/source/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@@ -1,12 +1,11 @@
-(deployment-skypilot)=
+---
+title: SkyPilot
+---
+[](){ #deployment-skypilot }
-# SkyPilot
-
-:::{raw} html
-:::
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
@@ -83,7 +82,11 @@ Check the output of the command. There will be a shareable gradio link (like the
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
```console
-HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
+HF_TOKEN="your-huggingface-token" \
+ sky launch serving.yaml \
+ --gpus A100:8 \
+ --env HF_TOKEN \
+ --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
```
## Scale up to multiple replicas
@@ -104,10 +107,8 @@ service:
max_completion_tokens: 1
```
-:::{raw} html
Click to see the full recipe YAML
-:::
```yaml
service:
@@ -153,14 +154,14 @@ run: |
2>&1 | tee api_server.log
```
-:::{raw} html
-:::
Start the serving the Llama-3 8B model on multiple replicas:
```console
-HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
+HF_TOKEN="your-huggingface-token" \
+ sky serve up -n vllm serving.yaml \
+ --env HF_TOKEN
```
Wait until the service is ready:
@@ -169,10 +170,8 @@ Wait until the service is ready:
watch -n10 sky serve status vllm
```
-:::{raw} html
Example outputs:
-:::
```console
Services
@@ -185,9 +184,7 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
```
-:::{raw} html
-:::
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
@@ -223,10 +220,8 @@ service:
This will scale the service up to when the QPS exceeds 2 for each replica.
-:::{raw} html
Click to see the full recipe YAML
-:::
```yaml
service:
@@ -275,9 +270,7 @@ run: |
2>&1 | tee api_server.log
```
-:::{raw} html
-:::
To update the service with the new config:
@@ -295,10 +288,8 @@ sky serve down vllm
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
-:::{raw} html
Click to see the full GUI YAML
-:::
```yaml
envs:
@@ -328,14 +319,14 @@ run: |
--stop-token-ids 128009,128001 | tee ~/gradio.log
```
-:::{raw} html
-:::
1. Start the chat web UI:
```console
- sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
+ sky launch \
+ -c gui ./gui.yaml \
+ --env ENDPOINT=$(sky serve status --endpoint vllm)
```
2. Then, we can access the GUI at the returned gradio link:
diff --git a/docs/source/deployment/frameworks/streamlit.md b/docs/deployment/frameworks/streamlit.md
similarity index 81%
rename from docs/source/deployment/frameworks/streamlit.md
rename to docs/deployment/frameworks/streamlit.md
index 084550ec991e..33ed8c5f5b54 100644
--- a/docs/source/deployment/frameworks/streamlit.md
+++ b/docs/deployment/frameworks/streamlit.md
@@ -1,6 +1,7 @@
-(deployment-streamlit)=
-
-# Streamlit
+---
+title: Streamlit
+---
+[](){ #deployment-streamlit }
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.
@@ -32,11 +33,11 @@ pip install streamlit openai
streamlit run streamlit_openai_chatbot_webserver.py
# or specify the VLLM_API_BASE or VLLM_API_KEY
-VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run streamlit_openai_chatbot_webserver.py
+VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" \
+ streamlit run streamlit_openai_chatbot_webserver.py
# start with debug mode to view more details
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug
```
-:::{image} /assets/deployment/streamlit-chat.png
-:::
+
diff --git a/docs/source/deployment/frameworks/triton.md b/docs/deployment/frameworks/triton.md
similarity index 87%
rename from docs/source/deployment/frameworks/triton.md
rename to docs/deployment/frameworks/triton.md
index 94d87120159c..082bc24d85aa 100644
--- a/docs/source/deployment/frameworks/triton.md
+++ b/docs/deployment/frameworks/triton.md
@@ -1,5 +1,6 @@
-(deployment-triton)=
-
-# NVIDIA Triton
+---
+title: NVIDIA Triton
+---
+[](){ #deployment-triton }
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
diff --git a/docs/source/deployment/integrations/kserve.md b/docs/deployment/integrations/kserve.md
similarity index 85%
rename from docs/source/deployment/integrations/kserve.md
rename to docs/deployment/integrations/kserve.md
index c780fd74e8f5..754b983dee92 100644
--- a/docs/source/deployment/integrations/kserve.md
+++ b/docs/deployment/integrations/kserve.md
@@ -1,6 +1,7 @@
-(deployment-kserve)=
-
-# KServe
+---
+title: KServe
+---
+[](){ #deployment-kserve }
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
diff --git a/docs/source/deployment/integrations/kubeai.md b/docs/deployment/integrations/kubeai.md
similarity index 93%
rename from docs/source/deployment/integrations/kubeai.md
rename to docs/deployment/integrations/kubeai.md
index 2f5772e075d8..ba0a3c52cca7 100644
--- a/docs/source/deployment/integrations/kubeai.md
+++ b/docs/deployment/integrations/kubeai.md
@@ -1,6 +1,7 @@
-(deployment-kubeai)=
-
-# KubeAI
+---
+title: KubeAI
+---
+[](){ #deployment-kubeai }
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
diff --git a/docs/source/deployment/integrations/llamastack.md b/docs/deployment/integrations/llamastack.md
similarity index 94%
rename from docs/source/deployment/integrations/llamastack.md
rename to docs/deployment/integrations/llamastack.md
index a6c3569637ab..2ae600a423ff 100644
--- a/docs/source/deployment/integrations/llamastack.md
+++ b/docs/deployment/integrations/llamastack.md
@@ -1,6 +1,7 @@
-(deployment-llamastack)=
-
-# Llama Stack
+---
+title: Llama Stack
+---
+[](){ #deployment-llamastack }
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
diff --git a/docs/source/deployment/integrations/llmaz.md b/docs/deployment/integrations/llmaz.md
similarity index 87%
rename from docs/source/deployment/integrations/llmaz.md
rename to docs/deployment/integrations/llmaz.md
index cd4a76353d26..03d284c34769 100644
--- a/docs/source/deployment/integrations/llmaz.md
+++ b/docs/deployment/integrations/llmaz.md
@@ -1,6 +1,7 @@
-(deployment-llmaz)=
-
-# llmaz
+---
+title: llmaz
+---
+[](){ #deployment-llmaz }
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
diff --git a/docs/source/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md
similarity index 98%
rename from docs/source/deployment/integrations/production-stack.md
rename to docs/deployment/integrations/production-stack.md
index 05f1568306cc..8288a4b6e6be 100644
--- a/docs/source/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -1,6 +1,7 @@
-(deployment-production-stack)=
-
-# Production stack
+---
+title: Production stack
+---
+[](){ #deployment-production-stack }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
@@ -114,7 +115,7 @@ To remove the deployment, run:
sudo helm uninstall vllm
```
-------
+---
### (Advanced) Configuring vLLM production stack
diff --git a/docs/source/deployment/k8s.md b/docs/deployment/k8s.md
similarity index 98%
rename from docs/source/deployment/k8s.md
rename to docs/deployment/k8s.md
index 9079cfa8e1b6..6b08c4960d02 100644
--- a/docs/source/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -1,6 +1,7 @@
-(deployment-k8s)=
-
-# Using Kubernetes
+---
+title: Using Kubernetes
+---
+[](){ #deployment-k8s }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
@@ -8,6 +9,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le
* [Deployment with GPUs](#deployment-with-gpus)
Alternatively, you can deploy vLLM to Kubernetes using any of the following:
+
* [Helm](frameworks/helm.md)
* [InftyAI/llmaz](integrations/llmaz.md)
* [KServe](integrations/kserve.md)
@@ -19,9 +21,8 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
## Deployment with CPUs
-:::{note}
-The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
-:::
+!!! note
+ The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
diff --git a/docs/source/deployment/nginx.md b/docs/deployment/nginx.md
similarity index 60%
rename from docs/source/deployment/nginx.md
rename to docs/deployment/nginx.md
index bf404f1098c3..80242919ba5b 100644
--- a/docs/source/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@@ -1,20 +1,21 @@
-(nginxloadbalancer)=
-
-# Using Nginx
+---
+title: Using Nginx
+---
+[](){ #nginxloadbalancer }
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
Table of contents:
-1. [Build Nginx Container](#nginxloadbalancer-nginx-build)
-2. [Create Simple Nginx Config file](#nginxloadbalancer-nginx-conf)
-3. [Build vLLM Container](#nginxloadbalancer-nginx-vllm-container)
-4. [Create Docker Network](#nginxloadbalancer-nginx-docker-network)
-5. [Launch vLLM Containers](#nginxloadbalancer-nginx-launch-container)
-6. [Launch Nginx](#nginxloadbalancer-nginx-launch-nginx)
-7. [Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx)
+1. [Build Nginx Container][nginxloadbalancer-nginx-build]
+2. [Create Simple Nginx Config file][nginxloadbalancer-nginx-conf]
+3. [Build vLLM Container][nginxloadbalancer-nginx-vllm-container]
+4. [Create Docker Network][nginxloadbalancer-nginx-docker-network]
+5. [Launch vLLM Containers][nginxloadbalancer-nginx-launch-container]
+6. [Launch Nginx][nginxloadbalancer-nginx-launch-nginx]
+7. [Verify That vLLM Servers Are Ready][nginxloadbalancer-nginx-verify-nginx]
-(nginxloadbalancer-nginx-build)=
+[](){ #nginxloadbalancer-nginx-build }
## Build Nginx Container
@@ -39,7 +40,7 @@ Build the container:
docker build . -f Dockerfile.nginx --tag nginx-lb
```
-(nginxloadbalancer-nginx-conf)=
+[](){ #nginxloadbalancer-nginx-conf }
## Create Simple Nginx Config file
@@ -63,7 +64,7 @@ server {
}
```
-(nginxloadbalancer-nginx-vllm-container)=
+[](){ #nginxloadbalancer-nginx-vllm-container }
## Build vLLM Container
@@ -76,10 +77,14 @@ If you are behind proxy, you can pass the proxy settings to the docker build com
```console
cd $vllm_root
-docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
+docker build \
+ -f docker/Dockerfile . \
+ --tag vllm \
+ --build-arg http_proxy=$http_proxy \
+ --build-arg https_proxy=$https_proxy
```
-(nginxloadbalancer-nginx-docker-network)=
+[](){ #nginxloadbalancer-nginx-docker-network }
## Create Docker Network
@@ -87,7 +92,7 @@ docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_prox
docker network create vllm_nginx
```
-(nginxloadbalancer-nginx-launch-container)=
+[](){ #nginxloadbalancer-nginx-launch-container }
## Launch vLLM Containers
@@ -101,23 +106,45 @@ Notes:
```console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
-docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
-docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
+docker run \
+ -itd \
+ --ipc host \
+ --network vllm_nginx \
+ --gpus device=0 \
+ --shm-size=10.24gb \
+ -v $hf_cache_dir:/root/.cache/huggingface/ \
+ -p 8081:8000 \
+ --name vllm0 vllm \
+ --model meta-llama/Llama-2-7b-chat-hf
+docker run \
+ -itd \
+ --ipc host \
+ --network vllm_nginx \
+ --gpus device=1 \
+ --shm-size=10.24gb \
+ -v $hf_cache_dir:/root/.cache/huggingface/ \
+ -p 8082:8000 \
+ --name vllm1 vllm \
+ --model meta-llama/Llama-2-7b-chat-hf
```
-:::{note}
-If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-:::
+!!! note
+ If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-(nginxloadbalancer-nginx-launch-nginx)=
+[](){ #nginxloadbalancer-nginx-launch-nginx }
## Launch Nginx
```console
-docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
+docker run \
+ -itd \
+ -p 8000:80 \
+ --network vllm_nginx \
+ -v ./nginx_conf/:/etc/nginx/conf.d/ \
+ --name nginx-lb nginx-lb:latest
```
-(nginxloadbalancer-nginx-verify-nginx)=
+[](){ #nginxloadbalancer-nginx-verify-nginx }
## Verify That vLLM Servers Are Ready
diff --git a/docs/source/design/arch_overview.md b/docs/design/arch_overview.md
similarity index 81%
rename from docs/source/design/arch_overview.md
rename to docs/design/arch_overview.md
index 94bda8b5c58d..75d3e1b7ccc7 100644
--- a/docs/source/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@@ -1,22 +1,18 @@
-(arch-overview)=
-
-# Architecture Overview
+---
+title: Architecture Overview
+---
+[](){ #arch-overview }
This document provides an overview of the vLLM architecture.
-:::{contents} Table of Contents
-:depth: 2
-:local: true
-:::
+[TOC]
## Entrypoints
vLLM provides a number of entrypoints for interacting with the system. The
following diagram shows the relationship between them.
-:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
-:alt: Entrypoints Diagram
-:::
+
### LLM Class
@@ -77,16 +73,14 @@ python -m vllm.entrypoints.openai.api_server --model
That code can be found in .
-More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.
+More details on the API server can be found in the [OpenAI-Compatible Server][openai-compatible-server] document.
## LLM Engine
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
the vLLM system, handling model inference and asynchronous request processing.
-:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
-:alt: LLMEngine Diagram
-:::
+
### LLMEngine
@@ -137,18 +131,16 @@ input tensors and capturing cudagraphs.
## Model
Every model runner object has one model object, which is the actual
-`torch.nn.Module` instance. See [huggingface_integration](#huggingface-integration) for how various
+`torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various
configurations affect the class we ultimately get.
## Class Hierarchy
The following figure shows the class hierarchy of vLLM:
-> :::{figure} /assets/design/hierarchy.png
-> :align: center
-> :alt: query
-> :width: 100%
-> :::
+>
+> { align="center" alt="query" width="100%" }
+>
There are several important design choices behind this class hierarchy:
@@ -178,44 +170,43 @@ of a vision model and a language model. By making the constructor uniform, we
can easily create a vision model and a language model and compose them into a
vision-language model.
-:::{note}
-To support this change, all vLLM models' signatures have been updated to:
-
-```python
-def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
-```
-
-To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
+!!! note
+ To support this change, all vLLM models' signatures have been updated to:
-```python
-class MyOldModel(nn.Module):
- def __init__(
- self,
- config,
- cache_config: Optional[CacheConfig] = None,
- quant_config: Optional[QuantizationConfig] = None,
- lora_config: Optional[LoRAConfig] = None,
- prefix: str = "",
- ) -> None:
- ...
-
-from vllm.config import VllmConfig
-class MyNewModel(MyOldModel):
+ ```python
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- config = vllm_config.model_config.hf_config
- cache_config = vllm_config.cache_config
- quant_config = vllm_config.quant_config
- lora_config = vllm_config.lora_config
- super().__init__(config, cache_config, quant_config, lora_config, prefix)
-
-if __version__ >= "0.6.4":
- MyModel = MyNewModel
-else:
- MyModel = MyOldModel
-```
-
-This way, the model can work with both old and new versions of vLLM.
-:::
+ ```
+
+ To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
+
+ ```python
+ class MyOldModel(nn.Module):
+ def __init__(
+ self,
+ config,
+ cache_config: Optional[CacheConfig] = None,
+ quant_config: Optional[QuantizationConfig] = None,
+ lora_config: Optional[LoRAConfig] = None,
+ prefix: str = "",
+ ) -> None:
+ ...
+
+ from vllm.config import VllmConfig
+ class MyNewModel(MyOldModel):
+ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+ config = vllm_config.model_config.hf_config
+ cache_config = vllm_config.cache_config
+ quant_config = vllm_config.quant_config
+ lora_config = vllm_config.lora_config
+ super().__init__(config, cache_config, quant_config, lora_config, prefix)
+
+ if __version__ >= "0.6.4":
+ MyModel = MyNewModel
+ else:
+ MyModel = MyOldModel
+ ```
+
+ This way, the model can work with both old and new versions of vLLM.
3\. **Sharding and Quantization at Initialization**: Certain features require
changing the model weights. For example, tensor parallelism needs to shard the
diff --git a/docs/source/design/automatic_prefix_caching.md b/docs/design/automatic_prefix_caching.md
similarity index 98%
rename from docs/source/design/automatic_prefix_caching.md
rename to docs/design/automatic_prefix_caching.md
index 3928e0c16568..80883bb1d90d 100644
--- a/docs/source/design/automatic_prefix_caching.md
+++ b/docs/design/automatic_prefix_caching.md
@@ -1,6 +1,7 @@
-(design-automatic-prefix-caching)=
-
-# Automatic Prefix Caching
+---
+title: Automatic Prefix Caching
+---
+[](){ #design-automatic-prefix-caching }
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
diff --git a/docs/source/design/huggingface_integration.md b/docs/design/huggingface_integration.md
similarity index 64%
rename from docs/source/design/huggingface_integration.md
rename to docs/design/huggingface_integration.md
index 7d271b1cfb3a..2d462ccb6535 100644
--- a/docs/source/design/huggingface_integration.md
+++ b/docs/design/huggingface_integration.md
@@ -1,23 +1,22 @@
-(huggingface-integration)=
-
-# Integration with HuggingFace
+---
+title: Integration with HuggingFace
+---
+[](){ #huggingface-integration }
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
Let's say we want to serve the popular QWen model by running `vllm serve Qwen/Qwen2-7B`.
1. The `model` argument is `Qwen/Qwen2-7B`. vLLM determines whether this model exists by checking for the corresponding config file `config.json`. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L162-L182) for the implementation. Within this process:
-
- - If the `model` argument corresponds to an existing local path, vLLM will load the config file directly from this path.
- - If the `model` argument is a HuggingFace model ID consisting of a username and model name, vLLM will first try to use the config file from the HuggingFace local cache, using the `model` argument as the model name and the `--revision` argument as the revision. See [their website](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information on how the HuggingFace cache works.
- - If the `model` argument is a HuggingFace model ID but it is not found in the cache, vLLM will download the config file from the HuggingFace model hub. Refer to [this function](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L91) for the implementation. The input arguments include the `model` argument as the model name, the `--revision` argument as the revision, and the environment variable `HF_TOKEN` as the token to access the model hub. In our case, vLLM will download the [config.json](https://huggingface.co/Qwen/Qwen2-7B/blob/main/config.json) file.
+ - If the `model` argument corresponds to an existing local path, vLLM will load the config file directly from this path.
+ - If the `model` argument is a HuggingFace model ID consisting of a username and model name, vLLM will first try to use the config file from the HuggingFace local cache, using the `model` argument as the model name and the `--revision` argument as the revision. See [their website](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information on how the HuggingFace cache works.
+ - If the `model` argument is a HuggingFace model ID but it is not found in the cache, vLLM will download the config file from the HuggingFace model hub. Refer to [this function](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L91) for the implementation. The input arguments include the `model` argument as the model name, the `--revision` argument as the revision, and the environment variable `HF_TOKEN` as the token to access the model hub. In our case, vLLM will download the [config.json](https://huggingface.co/Qwen/Qwen2-7B/blob/main/config.json) file.
2. After confirming the existence of the model, vLLM loads its config file and converts it into a dictionary. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L185-L186) for the implementation.
3. Next, vLLM [inspects](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L189) the `model_type` field in the config dictionary to [generate](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L190-L216) the config object to use. There are some `model_type` values that vLLM directly supports; see [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L48) for the list. If the `model_type` is not in the list, vLLM will use [AutoConfig.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained) to load the config class, with `model`, `--revision`, and `--trust_remote_code` as the arguments. Please note that:
-
- - HuggingFace also has its own logic to determine the config class to use. It will again use the `model_type` field to search for the class name in the transformers library; see [here](https://github.com/huggingface/transformers/tree/main/src/transformers/models) for the list of supported models. If the `model_type` is not found, HuggingFace will use the `auto_map` field from the config JSON file to determine the class name. Specifically, it is the `AutoConfig` field under `auto_map`. See [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/main/config.json) for an example.
- - The `AutoConfig` field under `auto_map` points to a module path in the model's repository. To create the config class, HuggingFace will import the module and use the `from_pretrained` method to load the config class. This can generally cause arbitrary code execution, so it is only executed when `--trust_remote_code` is enabled.
+ - HuggingFace also has its own logic to determine the config class to use. It will again use the `model_type` field to search for the class name in the transformers library; see [here](https://github.com/huggingface/transformers/tree/main/src/transformers/models) for the list of supported models. If the `model_type` is not found, HuggingFace will use the `auto_map` field from the config JSON file to determine the class name. Specifically, it is the `AutoConfig` field under `auto_map`. See [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/main/config.json) for an example.
+ - The `AutoConfig` field under `auto_map` points to a module path in the model's repository. To create the config class, HuggingFace will import the module and use the `from_pretrained` method to load the config class. This can generally cause arbitrary code execution, so it is only executed when `--trust_remote_code` is enabled.
4. Subsequently, vLLM applies some historical patches to the config object. These are mostly related to RoPE configuration; see [here](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/config.py#L244) for the implementation.
@@ -28,8 +27,7 @@ Beyond that, there are two more things vLLM depends on HuggingFace for.
1. **Tokenizer**: vLLM uses the tokenizer from HuggingFace to tokenize the input text. The tokenizer is loaded using [AutoTokenizer.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) with the `model` argument as the model name and the `--revision` argument as the revision. It is also possible to use a tokenizer from another model by specifying the `--tokenizer` argument in the `vllm serve` command. Other relevant arguments are `--tokenizer-revision` and `--tokenizer-mode`. Please check HuggingFace's documentation for the meaning of these arguments. This part of the logic can be found in the [get_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L87) function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in [get_cached_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L24).
2. **Model weight**: vLLM downloads the model weight from the HuggingFace model hub using the `model` argument as the model name and the `--revision` argument as the revision. vLLM provides the argument `--load-format` to control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass `--load-format dummy` to skip downloading the weights.
-
- - It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the [documentation](https://huggingface.co/docs/safetensors/en/index) for more information on the safetensors format. This part of the logic can be found [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/model_executor/model_loader/loader.py#L385). Please note that:
+ - It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the [documentation](https://huggingface.co/docs/safetensors/en/index) for more information on the safetensors format. This part of the logic can be found [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/model_executor/model_loader/loader.py#L385). Please note that:
This completes the integration between vLLM and HuggingFace.
diff --git a/docs/design/kernel/paged_attention.md b/docs/design/kernel/paged_attention.md
new file mode 100644
index 000000000000..6ebe1ee48acf
--- /dev/null
+++ b/docs/design/kernel/paged_attention.md
@@ -0,0 +1,498 @@
+---
+title: vLLM Paged Attention
+---
+[](){ #design-paged-attention }
+
+Currently, vLLM utilizes its own implementation of a multi-head query
+attention kernel (`csrc/attention/attention_kernels.cu`).
+This kernel is designed to be compatible with
+vLLM's paged KV caches, where the key and value cache are stored in
+separate blocks (note that this block concept differs from the GPU
+thread block. So in a later document, I will refer to vLLM paged
+attention block as "block", while refer to GPU thread block as
+"thread block").
+
+To achieve high performance, this kernel relies on a specially
+designed memory layout and access method, specifically when threads
+read data from global memory to shared memory. The purpose of this
+document is to provide a high-level explanation of the kernel
+implementation step by step, aiding those who wish to learn about the
+vLLM multi-head query attention kernel. After going through this
+document, users will likely have a better understanding and feel easier
+to follow the actual implementation.
+
+Please note that this document may not cover all details, such as how
+to calculate the correct index for the corresponding data or the dot
+multiplication implementation. However, after reading this document
+and becoming familiar with the high-level logic flow, it should be
+easier for you to read the actual code and understand the details.
+
+## Inputs
+
+The kernel function takes a list of arguments for the current thread
+to perform its assigned work. The three most important arguments are
+the input pointers `q`, `k_cache`, and `v_cache`, which point
+to query, key, and value data on global memory that need to be read
+and processed. The output pointer `out` points to global memory
+where the result should be written. These four pointers actually
+refer to multi-dimensional arrays, but each thread only accesses the
+portion of data assigned to it. I have omitted all other runtime
+parameters here for simplicity.
+
+```cpp
+template
+__device__ void paged_attention_kernel(
+ ... // Other side args.
+ const scalar_t* __restrict__ out, // [num_seqs, num_heads, max_num_partitions, head_size]
+ const scalar_t* __restrict__ q, // [num_seqs, num_heads, head_size]
+ const scalar_t* __restrict__ k_cache, // [num_blocks, num_kv_heads, head_size/x, block_size, x]
+ const scalar_t* __restrict__ v_cache, // [num_blocks, num_kv_heads, head_size, block_size]
+ ... // Other side args.
+)
+```
+
+There are also a list of template arguments above the function
+signature that are determined during compilation time. `scalar_t`
+represents the data type of the query, key, and value data elements,
+such as FP16. `HEAD_SIZE` indicates the number of elements in each
+head. `BLOCK_SIZE` refers to the number of tokens in each block.
+`NUM_THREADS` denotes the number of threads in each thread block.
+`PARTITION_SIZE` represents the number of tensor parallel GPUs (For
+simplicity, we assume this is 0 and tensor parallel is disabled).
+
+With these arguments, we need to perform a sequence of preparations.
+This includes calculating the current head index, block index, and
+other necessary variables. However, for now, we can ignore these
+preparations and proceed directly to the actual calculations. It will
+be easier to understand them once we grasp the entire flow.
+
+## Concepts
+
+Just before we dive into the calculation flow, I want to describe a
+few concepts that are needed for later sections. However, you may
+skip this section and return later if you encounter any confusing
+terminologies.
+
+- **Sequence**: A sequence represents a client request. For example,
+ the data pointed to by `q` has a shape of
+ `[num_seqs, num_heads, head_size]`. That represents there are total
+ `num_seqs` of query sequence data are pointed by `q`. Since this
+ kernel is a single query attention kernel, each sequence only has one
+ query token. Hence, the `num_seqs` equals the total number of tokens
+ that are processed in the batch.
+- **Context**: The context consists of the generated tokens from the
+ sequence. For instance, `["What", "is", "your"]` are the context
+ tokens, and the input query token is `"name"`. The model might
+ generate the token `"?"`.
+- **Vec**: The vec is a list of elements that are fetched and
+ calculated together. For query and key data, the vec size
+ (`VEC_SIZE`) is determined so that each thread group can fetch and
+ calculate 16 bytes of data at a time. For value data, the vec size
+ (`V_VEC_SIZE`) is determined so that each thread can fetch and
+ calculate 16 bytes of data at a time. For example, if the
+ `scalar_t` is FP16 (2 bytes) and `THREAD_GROUP_SIZE` is 2, the
+ `VEC_SIZE` will be 4, while the `V_VEC_SIZE` will be 8.
+- **Thread group**: The thread group is a small group of
+ threads(`THREAD_GROUP_SIZE`) that fetches and calculates one
+ query token and one key token at a time. Each thread handles only a
+ portion of the token data. The total number of elements processed by
+ one thread group is referred as `x`. For example, if the thread
+ group contains 2 threads and the head size is 8, then thread 0
+ handles the query and key elements at index 0, 2, 4, 6, while thread
+ 1 handles the elements at index 1, 3, 5, 7.
+- **Block**: The key and value cache data in vLLM are split into
+ blocks. Each block stores data for a fixed number(`BLOCK_SIZE`)
+ of tokens at one head. Each block may contain only a portion of the
+ whole context tokens. For example, if the block size is 16 and the
+ head size is 128, then for one head, one block can store 16 * 128 =
+ 2048 elements.
+- **Warp**: A warp is a group of 32 threads(`WARP_SIZE`) that
+ execute simultaneously on a stream multiprocessor (SM). In this
+ kernel, each warp processes the calculation between one query token
+ and key tokens of one entire block at a time (it may process multiple
+ blocks in multiple iterations). For example, if there are 4 warps and
+ 6 blocks for one context, the assignment would be like warp 0 handles
+ the 0th, 4th blocks, warp 1 handles the 1st, 5th blocks, warp 2
+ handles the 2nd block and warp 3 handles the 3rd block.
+- **Thread block**: A thread block is a group of
+ threads(`NUM_THREADS`) that can access the same shared memory.
+ Each thread block contains multiple warps(`NUM_WARPS`), and in
+ this kernel, each thread block processes the calculation between one
+ query token and key tokens of a whole context.
+- **Grid**: A grid is a collection of thread blocks and defines the
+ shape of the collection. In this kernel, the shape is
+ `(num_heads, num_seqs, max_num_partitions)`. Therefore, each thread
+ block only handles the calculation for one head, one sequence, and
+ one partition.
+
+## Query
+
+This section will introduce how query data is stored in memory and
+fetched by each thread. As mentioned above, each thread group fetches
+one query token data, while each thread itself only handles a part of
+one query token data. Within each warp, every thread group will fetch
+the same query token data, but will multiply it with different key
+token data.
+
+```cpp
+const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
+```
+
+
+ { align="center" alt="query" width="70%" }
+
+
+Each thread defines its own `q_ptr` which points to the assigned
+query token data on global memory. For example, if `VEC_SIZE` is 4
+and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
+total of 128 elements divided into 128 / 4 = 32 vecs.
+
+
+ { align="center" alt="q_vecs" width="70%" }
+
+
+```cpp
+__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
+```
+
+Next, we need to read the global memory data pointed to by `q_ptr`
+into shared memory as `q_vecs`. It is important to note that each
+vecs is assigned to a different row. For example, if the
+`THREAD_GROUP_SIZE` is 2, thread 0 will handle the 0th row vecs,
+while thread 1 handles the 1st row vecs. By reading the query data in
+this way, neighboring threads like thread 0 and thread 1 can read
+neighbor memory, achieving the memory coalescing to improve
+performance.
+
+## Key
+
+Similar to the "Query" section, this section introduces memory layout
+and assignment for keys. While each thread group only handle one
+query token one kernel run, it may handle multiple key tokens across
+multiple iterations. Meanwhile, each warp will process multiple blocks
+of key tokens in multiple iterations, ensuring that all context
+tokens are processed by the entire thread group after the kernel run.
+In this context, "handle" refers to performing the dot multiplication
+between query data and key data.
+
+```cpp
+const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride
+ + kv_head_idx * kv_head_stride
+ + physical_block_offset * x;
+```
+
+Unlike to `q_ptr`, `k_ptr` in each thread will point to different
+key token at different iterations. As shown above, that `k_ptr`
+points to key token data based on `k_cache` at assigned block,
+assigned head and assigned token.
+
+
+ { align="center" alt="key" width="70%" }
+
+
+The diagram above illustrates the memory layout for key data. It
+assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
+8, `THREAD_GROUP_SIZE` is 2, and there are a total of 4 warps. Each
+rectangle represents all the elements for one key token at one head,
+which will be processed by one thread group. The left half shows the
+total 16 blocks of key token data for warp 0, while the right half
+represents the remaining key token data for other warps or
+iterations. Inside each rectangle, there are a total 32 vecs (128
+elements for one token) that will be processed by 2 threads (one
+thread group) separately.
+
+
+ { align="center" alt="k_vecs" width="70%" }
+
+
+```cpp
+K_vec k_vecs[NUM_VECS_PER_THREAD]
+```
+
+Next, we need to read the key token data from `k_ptr` and store
+them on register memory as `k_vecs`. We use register memory for
+`k_vecs` because it will only be accessed by one thread once,
+whereas `q_vecs` will be accessed by multiple threads multiple
+times. Each `k_vecs` will contain multiple vectors for later
+calculation. Each vec will be set at each inner iteration. The
+assignment of vecs allows neighboring threads in a warp to read
+neighboring memory together, which again promotes the memory
+coalescing. For instance, thread 0 will read vec 0, while thread 1
+will read vec 1. In the next inner loop, thread 0 will read vec 2,
+while thread 1 will read vec 3, and so on.
+
+You may still be a little confused about the overall flow. Don't
+worry, please keep reading the next "QK" section. It will illustrate
+the query and key calculation flow in a clearer and higher-level
+manner.
+
+## QK
+
+As shown the pseudo code below, before the entire for loop block, we
+fetch the query data for one token and store it in `q_vecs`. Then,
+in the outer for loop, we iterate through different `k_ptrs` that
+point to different tokens and prepare the `k_vecs` in the inner for
+loop. Finally, we perform the dot multiplication between the
+`q_vecs` and each `k_vecs`.
+
+```cpp
+q_vecs = ...
+for ... {
+ k_ptr = ...
+ for ... {
+ k_vecs[i] = ...
+ }
+ ...
+ float qk = scale * Qk_dot::dot(q_vecs[thread_group_offset], k_vecs);
+}
+```
+
+As mentioned before, for each thread, it only fetches part of the
+query and key token data at a time. However, there will be a cross
+thread group reduction happen in the `Qk_dot<>::dot` . So `qk`
+returned here is not just between part of the query and key token dot
+multiplication, but actually a full result between entire query and
+key token data.
+
+For example, if the value of `HEAD_SIZE` is 128 and
+`THREAD_GROUP_SIZE` is 2, each thread's `k_vecs` will contain
+total 64 elements. However, the returned `qk` is actually the
+result of dot multiplication between 128 query elements and 128 key
+elements. If you want to learn more about the details of the dot
+multiplication and reduction, you may refer to the implementation of
+`Qk_dot<>::dot`. However, for the sake of simplicity, I will not
+cover it in this document.
+
+## Softmax
+
+Next, we need to calculate the normalized softmax for all `qk`s,
+as shown above, where each $x$ represents a `qk`. To do this,
+we must obtain the reduced value of `qk_max`($m(x)$) and
+the `exp_sum`($\ell(x)$) of all `qk`s. The reduction
+should be performed across the entire thread block, encompassing
+results between the query token and all context key tokens.
+
+$$
+\begin{gather*}
+m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
+\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
+\end{gather*}
+$$
+
+### `qk_max` and `logits`
+
+Just right after we get the `qk` result, we can set the temporary
+`logits` result with `qk` (In the end, the `logits` should
+store the normalized softmax result). Also we can compare and collect
+the `qk_max` for all `qk`s that are calculated by current
+thread group.
+
+```cpp
+if (thread_group_offset == 0) {
+ const bool mask = token_idx >= context_len;
+ logits[token_idx - start_token_idx] = mask ? 0.f : qk;
+ qk_max = mask ? qk_max : fmaxf(qk_max, qk);
+}
+```
+
+Please note that the `logits` here is on shared memory, so each
+thread group will set the fields for its own assigned context tokens.
+Overall, the size of logits should be number of context tokens.
+
+```cpp
+for (int mask = WARP_SIZE / 2; mask >= THREAD_GROUP_SIZE; mask /= 2) {
+ qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
+}
+
+if (lane == 0) {
+ red_smem[warp_idx] = qk_max;
+}
+```
+
+Then we need to get the reduced `qk_max` across each warp. The main
+idea is to make threads in warp to communicate with each other and
+get the final max `qk` .
+
+```cpp
+for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) {
+ qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
+}
+qk_max = VLLM_SHFL_SYNC(qk_max, 0);
+```
+
+Finally, we can get the reduced `qk_max` from whole thread block by
+compare the `qk_max` from all warps in this thread block. Then we
+need to broadcast the final result to each thread.
+
+### `exp_sum`
+
+Similar to `qk_max`, we need to get the reduced sum value from the
+entire thread block too.
+
+```cpp
+for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
+ float val = __expf(logits[i] - qk_max);
+ logits[i] = val;
+ exp_sum += val;
+}
+...
+exp_sum = block_sum(&red_smem[NUM_WARPS], exp_sum);
+```
+
+Firstly, sum all exp values from each thread group, and meanwhile,
+convert each entry of `logits` from `qk` to `exp(qk - qk_max)`.
+Please note, the `qk_max` here is already the max `qk` across the
+whole thread block. And then we can do reduction for `exp_sum`
+across whole thread block just like the `qk_max`.
+
+```cpp
+const float inv_sum = __fdividef(1.f, exp_sum + 1e-6f);
+for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
+ logits[i] *= inv_sum;
+}
+```
+
+Finally, with the reduced `qk_max` and `exp_sum`, we can obtain
+the final normalized softmax result as `logits`. This `logits`
+variable will be used for dot multiplication with the value data in
+later steps. Now, it should store the normalized softmax result of
+`qk` for all assigned context tokens.
+
+## Value
+
+
+ { align="center" alt="value" width="70%" }
+
+
+
+ { align="center" alt="logits_vec" width="50%" }
+
+
+
+ { align="center" alt="v_vec" width="70%" }
+
+
+Now we need to retrieve the value data and perform dot multiplication
+with `logits`. Unlike query and key, there is no thread group
+concept for value data. As shown in diagram, different from key token
+memory layout, elements from the same column correspond to the same
+value token. For one block of value data, there are `HEAD_SIZE` of
+rows and `BLOCK_SIZE` of columns that are split into multiple
+`v_vecs`.
+
+Each thread always fetches `V_VEC_SIZE` elements from the same
+`V_VEC_SIZE` of tokens at a time. As a result, a single thread
+retrieves multiple `v_vec`s from different rows and the same
+columns through multiple inner iterations. For each `v_vec`, it
+needs to be dot multiplied with the corresponding `logits_vec`,
+which is also `V_VEC_SIZE` elements from `logits`. Overall, with
+multiple inner iterations, each warp will process one block of value
+tokens. And with multiple outer iterations, the whole context value
+tokens are processed
+
+```cpp
+float accs[NUM_ROWS_PER_THREAD];
+for ... { // Iteration over different blocks.
+ logits_vec = ...
+ for ... { // Iteration over different rows.
+ v_vec = ...
+ ...
+ accs[i] += dot(logits_vec, v_vec);
+ }
+}
+```
+
+As shown in the above pseudo code, in the outer loop, similar to
+`k_ptr`, `logits_vec` iterates over different blocks and reads
+`V_VEC_SIZE` elements from `logits`. In the inner loop, each
+thread reads `V_VEC_SIZE` elements from the same tokens as a
+`v_vec` and performs dot multiplication. It is important to note
+that in each inner iteration, the thread fetches different head
+position elements for the same tokens. The dot result is then
+accumulated in `accs`. Therefore, each entry of `accs` is mapped
+to a head position assigned to the current thread.
+
+For example, if `BLOCK_SIZE` is 16 and `V_VEC_SIZE` is 8, each
+thread fetches 8 value elements for 8 tokens at a time. Each element
+is from different tokens at the same head position. If `HEAD_SIZE`
+is 128 and `WARP_SIZE` is 32, for each inner loop, a warp needs to
+fetch `WARP_SIZE * V_VEC_SIZE = 256` elements. This means there are
+a total of 128 * 16 / 256 = 8 inner iterations for a warp to handle
+a whole block of value tokens. And each `accs` in each thread
+contains 8 elements that accumulated at 8 different head positions.
+For the thread 0, the `accs` variable will have 8 elements, which
+are 0th, 32th โฆ 224th elements of a value head that are accumulated
+from all assigned 8 tokens.
+
+## LV
+
+Now, we need to perform reduction for `accs` within each warp. This
+process allows each thread to accumulate the `accs` for the
+assigned head positions of all tokens in one block.
+
+```cpp
+for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
+ float acc = accs[i];
+ for (int mask = NUM_V_VECS_PER_ROW / 2; mask >= 1; mask /= 2) {
+ acc += VLLM_SHFL_XOR_SYNC(acc, mask);
+ }
+ accs[i] = acc;
+}
+```
+
+Next, we perform reduction for `accs` across all warps, allowing
+each thread to have the accumulation of `accs` for the assigned
+head positions of all context tokens. Please note that each `accs`
+in every thread only stores the accumulation for a portion of
+elements of the entire head for all context tokens. However, overall,
+all results for output have been calculated but are just stored in
+different thread register memory.
+
+```cpp
+float* out_smem = reinterpret_cast(shared_mem);
+for (int i = NUM_WARPS; i > 1; i /= 2) {
+ // Upper warps write to shared memory.
+ ...
+ float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE];
+ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
+ ...
+ dst[row_idx] = accs[i];
+ }
+
+ // Lower warps update the output.
+ const float* src = &out_smem[warp_idx * HEAD_SIZE];
+ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
+ ...
+ accs[i] += src[row_idx];
+ }
+
+ // Write out the accs.
+}
+```
+
+## Output
+
+Now we can write all of calculated result from local register memory
+to final output global memory.
+
+```cpp
+scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
+ + head_idx * max_num_partitions * HEAD_SIZE
+ + partition_idx * HEAD_SIZE;
+```
+
+First, we need to define the `out_ptr` variable, which points to
+the start address of the assigned sequence and assigned head.
+
+```cpp
+for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
+ const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
+ if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) {
+ from_float(*(out_ptr + row_idx), accs[i]);
+ }
+}
+```
+
+Finally, we need to iterate over different assigned head positions
+and write out the corresponding accumulated result based on the
+`out_ptr`.
diff --git a/docs/source/design/mm_processing.md b/docs/design/mm_processing.md
similarity index 61%
rename from docs/source/design/mm_processing.md
rename to docs/design/mm_processing.md
index dc92a3c2c511..f3685ce76a4b 100644
--- a/docs/source/design/mm_processing.md
+++ b/docs/design/mm_processing.md
@@ -1,10 +1,11 @@
-(mm-processing)=
+---
+title: Multi-Modal Data Processing
+---
+[](){ #mm-processing }
-# Multi-Modal Data Processing
+To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. ``) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
-To enable various optimizations in vLLM such as [chunked prefill](#chunked-prefill) and [prefix caching](#automatic-prefix-caching), we use {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` to provide the correspondence between placeholder feature tokens (e.g. ``) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
-
-Here are the main features of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`:
+Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:
## Prompt Update Detection
@@ -15,7 +16,7 @@ One of the main responsibilities of HF processor is to update the prompt with pl
The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
-In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptUpdate` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
+In vLLM, this information is specified using [PromptUpdate][vllm.multimodal.processing.PromptUpdate] in [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
## Tokenized Prompt Inputs
@@ -43,22 +44,22 @@ While HF processors support text + multi-modal inputs natively, this is not so f
Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.
-(mm-dummy-text)=
+[](){ #mm-dummy-text }
### Dummy text
-We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
+We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text]. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
-(mm-automatic-prompt-updating)=
+[](){ #mm-automatic-prompt-updating }
### Automatic prompt updating
We address the second issue by implementing model-agnostic code in
-{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates` to automatically update the prompt with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
+[_apply_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates] to automatically update the prompt with feature placeholder tokens based on the specification outputted by [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
### Summary
-With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`.
+With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in [_apply_hf_processor_main][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main].
## Processor Output Caching
@@ -66,4 +67,4 @@ Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238)
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
-Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#mm-automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other.
+Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text][mm-dummy-text] to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating][mm-automatic-prompt-updating] afterwards to keep the output tokens and multi-modal data consistent with each other.
diff --git a/docs/source/design/multiprocessing.md b/docs/design/multiprocessing.md
similarity index 96%
rename from docs/source/design/multiprocessing.md
rename to docs/design/multiprocessing.md
index 43fe5fe2e5e9..412c42fd580e 100644
--- a/docs/source/design/multiprocessing.md
+++ b/docs/design/multiprocessing.md
@@ -2,14 +2,13 @@
## Debugging
-Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
+Please see the [Troubleshooting][troubleshooting-python-multiprocessing]
page for information on known issues and how to solve them.
## Introduction
-:::{important}
-The source code references are to the state of the code at the time of writing in December, 2024.
-:::
+!!! warning
+ The source code references are to the state of the code at the time of writing in December, 2024.
The use of Python multiprocessing in vLLM is complicated by:
@@ -124,7 +123,7 @@ what is happening. First, a log message from vLLM:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
- https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
+ https://docs.vllm.ai/en/latest/usage/debugging.html#python-multiprocessing
for more information.
```
diff --git a/docs/source/design/plugin_system.md b/docs/design/plugin_system.md
similarity index 83%
rename from docs/source/design/plugin_system.md
rename to docs/design/plugin_system.md
index 225030885f62..0764dfb6501b 100644
--- a/docs/source/design/plugin_system.md
+++ b/docs/design/plugin_system.md
@@ -1,12 +1,13 @@
-(plugin-system)=
-
-# vLLM's Plugin System
+---
+title: vLLM's Plugin System
+---
+[](){ #plugin-system }
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
## How Plugins Work in vLLM
-Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [](#arch-overview)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
+Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
## How vLLM Discovers Plugins
@@ -29,8 +30,10 @@ def register():
from vllm import ModelRegistry
if "MyLlava" not in ModelRegistry.get_supported_archs():
- ModelRegistry.register_model("MyLlava",
- "vllm_add_dummy_model.my_llava:MyLlava")
+ ModelRegistry.register_model(
+ "MyLlava",
+ "vllm_add_dummy_model.my_llava:MyLlava",
+ )
```
For more information on adding entry points to your package, please check the [official documentation](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
diff --git a/docs/source/design/v1/metrics.md b/docs/design/v1/metrics.md
similarity index 97%
rename from docs/source/design/v1/metrics.md
rename to docs/design/v1/metrics.md
index de8022655372..7156ee9dd3ec 100644
--- a/docs/source/design/v1/metrics.md
+++ b/docs/design/v1/metrics.md
@@ -57,11 +57,11 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
-These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
+These are documented under [Inferencing and Serving -> Production Metrics](../../usage/metrics.md).
### Grafana Dashboard
-vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/getting_started/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
+vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
@@ -222,9 +222,7 @@ And the calculated intervals are:
Put another way:
-:::{image} /assets/design/v1/metrics/intervals-1.png
-:alt: Interval calculations - common case
-:::
+
We explored the possibility of having the frontend calculate these
intervals using the timing of events visible by the frontend. However,
@@ -239,17 +237,13 @@ When a preemption occurs during decode, since any already generated
tokens are reused, we consider the preemption as affecting the
inter-token, decode, and inference intervals.
-:::{image} /assets/design/v1/metrics/intervals-2.png
-:alt: Interval calculations - preempted decode
-:::
+
When a preemption occurs during prefill (assuming such an event
is possible), we consider the preemption as affecting the
time-to-first-token and prefill intervals.
-:::{image} /assets/design/v1/metrics/intervals-3.png
-:alt: Interval calculations - preempted prefill
-:::
+
### Frontend Stats Collection
@@ -467,7 +461,7 @@ In general:
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them.
-See the [deprecation policy](project:../../contributing/deprecation_policy.md) for
+See the [deprecation policy](../../contributing/deprecation_policy.md) for
the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total`
@@ -679,7 +673,7 @@ v0 has support for OpenTelemetry tracing:
- [OpenTelemetry blog
post](https://opentelemetry.io/blog/2024/llm-observability/)
- [User-facing
- docs](https://docs.vllm.ai/en/latest/getting_started/examples/opentelemetry.html)
+ docs](https://docs.vllm.ai/en/latest/examples/opentelemetry.html)
- [Blog
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [IBM product
diff --git a/docs/source/design/v1/prefix_caching.md b/docs/design/v1/prefix_caching.md
similarity index 94%
rename from docs/source/design/v1/prefix_caching.md
rename to docs/design/v1/prefix_caching.md
index 0f7475777797..ad041b0059f5 100644
--- a/docs/source/design/v1/prefix_caching.md
+++ b/docs/design/v1/prefix_caching.md
@@ -122,9 +122,7 @@ There are two design points to highlight:
As a result, we will have the following components when the KV cache manager is initialized:
-:::{image} /assets/design/v1/prefix_caching/overview.png
-:alt: Component Overview
-:::
+
* Block Pool: A list of KVCacheBlock.
* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
@@ -194,9 +192,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund
When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.
-:::{image} /assets/design/v1/prefix_caching/free.png
-:alt: Free Queue after Free a Request
-:::
+
### Eviction (LRU)
@@ -212,36 +208,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens),
**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens.
-:::{image} /assets/design/v1/prefix_caching/example-time-1.png
-:alt: Example Time 1
-:::
+
**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
-:::{image} /assets/design/v1/prefix_caching/example-time-3.png
-:alt: Example Time 3
-:::
+
**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.
-:::{image} /assets/design/v1/prefix_caching/example-time-4.png
-:alt: Example Time 4
-:::
+
**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
-:::{image} /assets/design/v1/prefix_caching/example-time-5.png
-:alt: Example Time 5
-:::
+
**Time 6: Request 1 is finished and free.**
-:::{image} /assets/design/v1/prefix_caching/example-time-6.png
-:alt: Example Time 6
-:::
+
**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
-:::{image} /assets/design/v1/prefix_caching/example-time-7.png
-:alt: Example Time 7
-:::
+
diff --git a/docs/source/design/v1/torch_compile.md b/docs/design/v1/torch_compile.md
similarity index 98%
rename from docs/source/design/v1/torch_compile.md
rename to docs/design/v1/torch_compile.md
index 4d8ce0fd9227..64b6f0cc0a9b 100644
--- a/docs/source/design/v1/torch_compile.md
+++ b/docs/design/v1/torch_compile.md
@@ -99,7 +99,9 @@ This time, Inductor compilation is completely bypassed, and we will load from di
The above example just uses Inductor to compile for a general shape (i.e. symbolic shape). We can also use Inductor to compile for some of the specific shapes, for example:
-`vllm serve meta-llama/Llama-3.2-1B --compilation_config "{'compile_sizes': [1, 2, 4, 8]}"`
+```
+vllm serve meta-llama/Llama-3.2-1B --compilation_config '{"compile_sizes": [1, 2, 4, 8]}'
+```
Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At this time, all of the shapes in the computation graph are static and known, and we will turn on auto-tuning to tune for max performance. This can be slow when you run it for the first time, but the next time you run it, we can directly bypass the tuning and run the tuned kernel.
@@ -134,12 +136,14 @@ The cudagraphs are captured and managed by the compiler backend, and replayed wh
By default, vLLM will try to determine a set of sizes to capture cudagraph. You can also override it using the config `cudagraph_capture_sizes`:
-`vllm serve meta-llama/Llama-3.2-1B --compilation-config "{'cudagraph_capture_sizes': [1, 2, 4, 8]}"`
+```
+vllm serve meta-llama/Llama-3.2-1B --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}'
+```
Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture.
### Full Cudagraph capture
-It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models. Enable this using `--compilation-config "{'full_cuda_graph': True}"`
+It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models. Enable this using `--compilation-config '{"full_cuda_graph": true}'`.
Currently only FlashAttention 3 is compatible, and only when cascade attention is disabled.
diff --git a/docs/features/automatic_prefix_caching.md b/docs/features/automatic_prefix_caching.md
new file mode 100644
index 000000000000..5e92796ddda7
--- /dev/null
+++ b/docs/features/automatic_prefix_caching.md
@@ -0,0 +1,28 @@
+---
+title: Automatic Prefix Caching
+---
+[](){ #automatic-prefix-caching }
+
+## Introduction
+
+Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
+
+!!! note
+ Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
+
+## Enabling APC in vLLM
+
+Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example:
+
+
+
+## Example workloads
+
+We describe two example workloads, where APC can provide huge performance benefit:
+
+- Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document *only once*, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency.
+- Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency.
+
+## Limits
+
+APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).
diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md
new file mode 100644
index 000000000000..77ceea49f173
--- /dev/null
+++ b/docs/features/compatibility_matrix.md
@@ -0,0 +1,77 @@
+---
+title: Compatibility Matrix
+---
+[](){ #compatibility-matrix }
+
+The tables below show mutually exclusive features and the support on some hardware.
+
+The symbols used have the following meanings:
+
+- โ = Full compatibility
+- ๐ = Partial compatibility
+- โ = No compatibility
+
+!!! note
+ Check the โ or ๐ with links to see tracking issue for unsupported feature/hardware combination.
+
+## Feature x Feature
+
+
+
+| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | prmpt adptr | [SD][spec-decode] | CUDA graph | pooling | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search |
+|-----------------------------------------------------------|-------------------------|-----------------------------------|------------------------|---------------------------------------------------|---------------------|--------------|-----------------------------------------------|-------------------------------------------------------|--------------------------------------|---------------------------------------------------|-------------------------------------------------------------|--------------------|---------------------------------------------|-----------|---------------|
+| [CP][chunked-prefill] | โ | | | | | | | | | | | | | | |
+| [APC][automatic-prefix-caching] | โ | โ | | | | | | | | | | | | | |
+| [LoRA][lora-adapter] | โ | โ | โ | | | | | | | | | | | | |
+| prmpt adptr | โ | โ | โ | โ | | | | | | | | | | | |
+| [SD][spec-decode] | โ | โ | โ | โ | โ | | | | | | | | | | |
+| CUDA graph | โ | โ | โ | โ | โ | โ | | | | | | | | | |
+| pooling | โ | โ | โ | โ | โ | โ | โ | | | | | | | | |
+| enc-dec | โ | [โ](gh-issue:7366) | โ | โ | [โ](gh-issue:7366) | โ | โ | โ | | | | | | | |
+| logP | โ | โ | โ | โ | โ | โ | โ | โ | โ | | | | | | |
+| prmpt logP | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | | | | | |
+| async output | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | | | | |
+| multi-step | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | | | |
+| mm | โ | [๐ ](gh-pr:8348) | [๐ ](gh-pr:4194) | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | | |
+| best-of | โ | โ | โ | โ | [โ](gh-issue:6137) | โ | โ | โ | โ | โ | โ | [โ](gh-issue:7968) | โ | โ | |
+| beam-search | โ | โ | โ | โ | [โ](gh-issue:6137) | โ | โ | โ | โ | โ | โ | [โ](gh-issue:7968) | โ | โ | โ |
+
+[](){ #feature-x-hardware }
+
+## Feature x Hardware
+
+| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD |
+|-----------------------------------------------------------|--------------------|----------|----------|-------|----------|--------------------|-------|
+| [CP][chunked-prefill] | [โ](gh-issue:2729) | โ | โ | โ | โ | โ | โ |
+| [APC][automatic-prefix-caching] | [โ](gh-issue:3687) | โ | โ | โ | โ | โ | โ |
+| [LoRA][lora-adapter] | โ | โ | โ | โ | โ | โ | โ |
+| prmpt adptr | โ | โ | โ | โ | โ | [โ](gh-issue:8475) | โ |
+| [SD][spec-decode] | โ | โ | โ | โ | โ | โ | โ |
+| CUDA graph | โ | โ | โ | โ | โ | โ | โ |
+| pooling | โ | โ | โ | โ | โ | โ | โ |
+| enc-dec | โ | โ | โ | โ | โ | โ | โ |
+| mm | โ | โ | โ | โ | โ | โ | โ |
+| logP | โ | โ | โ | โ | โ | โ | โ |
+| prmpt logP | โ | โ | โ | โ | โ | โ | โ |
+| async output | โ | โ | โ | โ | โ | โ | โ |
+| multi-step | โ | โ | โ | โ | โ | [โ](gh-issue:8477) | โ |
+| best-of | โ | โ | โ | โ | โ | โ | โ |
+| beam-search | โ | โ | โ | โ | โ | โ | โ |
diff --git a/docs/source/features/disagg_prefill.md b/docs/features/disagg_prefill.md
similarity index 87%
rename from docs/source/features/disagg_prefill.md
rename to docs/features/disagg_prefill.md
index 2fa20140c086..54be05647d94 100644
--- a/docs/source/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
@@ -1,12 +1,12 @@
-(disagg-prefill)=
-
-# Disaggregated Prefilling (experimental)
+---
+title: Disaggregated Prefilling (experimental)
+---
+[](){ #disagg-prefill }
This page introduces you the disaggregated prefilling feature in vLLM.
-:::{note}
-This feature is experimental and subject to change.
-:::
+!!! note
+ This feature is experimental and subject to change.
## Why disaggregated prefilling?
@@ -15,9 +15,8 @@ Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
-:::{note}
-Disaggregated prefill DOES NOT improve throughput.
-:::
+!!! note
+ Disaggregated prefill DOES NOT improve throughput.
## Usage example
@@ -39,21 +38,16 @@ Key abstractions for disaggregated prefilling:
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
-:::{note}
-`insert` is non-blocking operation but `drop_select` is blocking operation.
-:::
+!!! note
+ `insert` is non-blocking operation but `drop_select` is blocking operation.
Here is a figure illustrating how the above 3 abstractions are organized:
-:::{image} /assets/features/disagg_prefill/abstraction.jpg
-:alt: Disaggregated prefilling abstractions
-:::
+
The workflow of disaggregated prefilling is as follows:
-:::{image} /assets/features/disagg_prefill/overview.jpg
-:alt: Disaggregated prefilling workflow
-:::
+
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
diff --git a/docs/source/features/lora.md b/docs/features/lora.md
similarity index 96%
rename from docs/source/features/lora.md
rename to docs/features/lora.md
index 5a3ce0c01f3f..642462f7c455 100644
--- a/docs/source/features/lora.md
+++ b/docs/features/lora.md
@@ -1,10 +1,11 @@
-(lora-adapter)=
-
-# LoRA Adapters
+---
+title: LoRA Adapters
+---
+[](){ #lora-adapter }
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
-LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`.
+LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA].
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with
@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
```
-:::{note}
-The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-:::
+!!! note
+ The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
diff --git a/docs/source/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md
similarity index 84%
rename from docs/source/features/multimodal_inputs.md
rename to docs/features/multimodal_inputs.md
index bb2997f008ed..19b668172902 100644
--- a/docs/source/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -1,20 +1,20 @@
-(multimodal-inputs)=
+---
+title: Multimodal Inputs
+---
+[](){ #multimodal-inputs }
-# Multimodal Inputs
+This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
-This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
-
-:::{note}
-We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
-and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
-:::
+!!! note
+ We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
+ and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
## Offline Inference
-To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
+To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
-- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`.
+- `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
### Image Inputs
@@ -211,16 +211,15 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
-:::{important}
-A chat template is **required** to use Chat Completions API.
-For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
+!!! warning
+ A chat template is **required** to use Chat Completions API.
+ For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
-If no default chat template is available, we will first look for a built-in fallback in .
-If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
+ If no default chat template is available, we will first look for a built-in fallback in .
+ If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
-For certain models, we provide alternative chat templates inside .
-For example, VLM2Vec uses which is different from the default one for Phi-3-Vision.
-:::
+ For certain models, we provide alternative chat templates inside .
+ For example, VLM2Vec uses which is different from the default one for Phi-3-Vision.
### Image Inputs
@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content)
Full example:
-:::{tip}
-Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
-and pass the file path as `url` in the API request.
-:::
-
-:::{tip}
-There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
-In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-:::
+!!! tip
+ Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
+ and pass the file path as `url` in the API request.
-:::{note}
-By default, the timeout for fetching images through HTTP URL is `5` seconds.
-You can override this by setting the environment variable:
+!!! tip
+ There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
+ In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-```console
-export VLLM_IMAGE_FETCH_TIMEOUT=
-```
+!!! note
+ By default, the timeout for fetching images through HTTP URL is `5` seconds.
+ You can override this by setting the environment variable:
-:::
+ ```console
+ export VLLM_IMAGE_FETCH_TIMEOUT=
+ ```
### Video Inputs
@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result)
Full example:
-:::{note}
-By default, the timeout for fetching videos through HTTP URL is `30` seconds.
-You can override this by setting the environment variable:
+!!! note
+ By default, the timeout for fetching videos through HTTP URL is `30` seconds.
+ You can override this by setting the environment variable:
-```console
-export VLLM_VIDEO_FETCH_TIMEOUT=
-```
-
-:::
+ ```console
+ export VLLM_VIDEO_FETCH_TIMEOUT=
+ ```
### Audio Inputs
@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result)
Full example:
-:::{note}
-By default, the timeout for fetching audios through HTTP URL is `10` seconds.
-You can override this by setting the environment variable:
-
-```console
-export VLLM_AUDIO_FETCH_TIMEOUT=
-```
+!!! note
+ By default, the timeout for fetching audios through HTTP URL is `10` seconds.
+ You can override this by setting the environment variable:
-:::
+ ```console
+ export VLLM_AUDIO_FETCH_TIMEOUT=
+ ```
### Embedding Inputs
@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create(
)
```
-:::{note}
-Only one message can contain `{"type": "image_embeds"}`.
-If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
-:::
+!!! note
+ Only one message can contain `{"type": "image_embeds"}`.
+ If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
diff --git a/docs/features/prompt_embeds.md b/docs/features/prompt_embeds.md
new file mode 100644
index 000000000000..6f5616e05d8c
--- /dev/null
+++ b/docs/features/prompt_embeds.md
@@ -0,0 +1,43 @@
+# Prompt Embedding Inputs
+
+This page teaches you how to pass prompt embedding inputs to vLLM.
+
+## What are prompt embeddings?
+
+The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
+
+!!! note
+ Prompt embeddings are currently only supported in the v0 engine.
+
+## Offline Inference
+
+To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
+
+- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
+
+### Hugging Face Transformers Inputs
+
+You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
+
+
+
+## Online Serving
+
+Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.
+
+When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
+
+Prompt embeddings are passed in as base64 encoded torch tensors.
+
+### Transformers Inputs via OpenAI Client
+
+First, launch the OpenAI-compatible server:
+
+```bash
+vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
+ --max-model-len 4096 --enable-prompt-embeds
+```
+
+Then, you can use the OpenAI client as follows:
+
+
diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md
new file mode 100644
index 000000000000..71f62065f63d
--- /dev/null
+++ b/docs/features/quantization/README.md
@@ -0,0 +1,22 @@
+---
+title: Quantization
+---
+[](){ #quantization-index }
+
+Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
+
+Contents:
+
+- [Supported_Hardware](supported_hardware.md)
+- [Auto_Awq](auto_awq.md)
+- [Bnb](bnb.md)
+- [Bitblas](bitblas.md)
+- [Gguf](gguf.md)
+- [Gptqmodel](gptqmodel.md)
+- [Int4](int4.md)
+- [Int8](int8.md)
+- [Fp8](fp8.md)
+- [Modelopt](modelopt.md)
+- [Quark](quark.md)
+- [Quantized_Kvcache](quantized_kvcache.md)
+- [Torchao](torchao.md)
diff --git a/docs/source/features/quantization/auto_awq.md b/docs/features/quantization/auto_awq.md
similarity index 93%
rename from docs/source/features/quantization/auto_awq.md
rename to docs/features/quantization/auto_awq.md
index b4ac597f5a79..4366a080f52c 100644
--- a/docs/source/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@@ -1,6 +1,7 @@
-(auto-awq)=
-
-# AutoAWQ
+---
+title: AutoAWQ
+---
+[](){ #auto-awq }
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
@@ -41,7 +42,9 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console
-python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
+python examples/offline_inference/llm_engine_example.py \
+ --model TheBloke/Llama-2-7b-Chat-AWQ \
+ --quantization awq
```
AWQ models are also supported directly through the LLM entrypoint:
diff --git a/docs/source/features/quantization/bitblas.md b/docs/features/quantization/bitblas.md
similarity index 62%
rename from docs/source/features/quantization/bitblas.md
rename to docs/features/quantization/bitblas.md
index d0b2bf858c9b..9001725d9c02 100644
--- a/docs/source/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@@ -1,14 +1,14 @@
-(bitblas)=
-
-# BitBLAS
+---
+title: BitBLAS
+---
+[](){ #bitblas }
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
-:::{note}
-Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
-Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
-For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
-:::
+!!! note
+ Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
+ Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
+ For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
Below are the steps to utilize BitBLAS with vLLM.
@@ -33,7 +33,12 @@ import torch
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
-llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas")
+llm = LLM(
+ model=model_id,
+ dtype=torch.bfloat16,
+ trust_remote_code=True,
+ quantization="bitblas"
+)
```
## Read gptq format checkpoint
@@ -44,5 +49,11 @@ import torch
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
-llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024)
+llm = LLM(
+ model=model_id,
+ dtype=torch.float16,
+ trust_remote_code=True,
+ quantization="bitblas",
+ max_model_len=1024
+)
```
diff --git a/docs/source/features/quantization/bnb.md b/docs/features/quantization/bnb.md
similarity index 79%
rename from docs/source/features/quantization/bnb.md
rename to docs/features/quantization/bnb.md
index 1843a33a3dfd..a8dc2476f30a 100644
--- a/docs/source/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@@ -1,6 +1,7 @@
-(bits-and-bytes)=
-
-# BitsAndBytes
+---
+title: BitsAndBytes
+---
+[](){ #bits-and-bytes }
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
@@ -14,7 +15,7 @@ pip install bitsandbytes>=0.45.3
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
-You can find bitsandbytes quantized models on .
+You can find bitsandbytes quantized models on [Hugging Face](https://huggingface.co/models?search=bitsandbytes).
And usually, these repositories have a config.json file that includes a quantization_config section.
## Read quantized checkpoint
@@ -26,7 +27,11 @@ from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
-llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
+llm = LLM(
+ model=model_id,
+ dtype=torch.bfloat16,
+ trust_remote_code=True
+)
```
## Inflight quantization: load as 4bit quantization
@@ -37,8 +42,12 @@ For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify
from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
-llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
-quantization="bitsandbytes")
+llm = LLM(
+ model=model_id,
+ dtype=torch.bfloat16,
+ trust_remote_code=True,
+ quantization="bitsandbytes"
+)
```
## OpenAI Compatible Server
diff --git a/docs/source/features/quantization/fp8.md b/docs/features/quantization/fp8.md
similarity index 88%
rename from docs/source/features/quantization/fp8.md
rename to docs/features/quantization/fp8.md
index cb304d54726c..01d5d9da046d 100644
--- a/docs/source/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -1,6 +1,7 @@
-(fp8)=
-
-# FP8 W8A8
+---
+title: FP8 W8A8
+---
+[](){ #fp8 }
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
-:::{note}
-FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
-FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
-:::
+!!! note
+ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+ FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
## Installation
@@ -94,9 +94,8 @@ print(result[0].outputs[0].text)
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
-:::{note}
-Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
-:::
+!!! note
+ Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is")
print(result[0].outputs[0].text)
```
-:::{warning}
-Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-:::
+!!! warning
+ Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
diff --git a/docs/source/features/quantization/gguf.md b/docs/features/quantization/gguf.md
similarity index 64%
rename from docs/source/features/quantization/gguf.md
rename to docs/features/quantization/gguf.md
index e93e4dcd3b57..72f758f653a8 100644
--- a/docs/source/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@@ -1,39 +1,42 @@
-(gguf)=
+---
+title: GGUF
+---
+[](){ #gguf }
-# GGUF
+!!! warning
+ Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-:::{warning}
-Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-:::
-
-:::{warning}
-Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
-:::
+!!! warning
+ Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
```console
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
+vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
+ --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
+vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
+ --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+ --tensor-parallel-size 2
```
-:::{warning}
-We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
-:::
+!!! warning
+ We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
```console
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
-vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
+vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
+ --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+ --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
```
You can also use the GGUF model directly through the LLM entrypoint:
diff --git a/docs/source/features/quantization/gptqmodel.md b/docs/features/quantization/gptqmodel.md
similarity index 95%
rename from docs/source/features/quantization/gptqmodel.md
rename to docs/features/quantization/gptqmodel.md
index 9771d5a4fe9e..53e938d2cbd7 100644
--- a/docs/source/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@@ -1,6 +1,7 @@
-(gptqmodel)=
-
-# GPTQModel
+---
+title: GPTQModel
+---
+[](){ #gptqmodel }
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
@@ -58,7 +59,8 @@ model.save(quant_path)
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
```console
-python examples/offline_inference/llm_engine_example.py --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
+python examples/offline_inference/llm_engine_example.py \
+ --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
```
## Using GPTQModel with vLLM's Python API
diff --git a/docs/source/features/quantization/int4.md b/docs/features/quantization/int4.md
similarity index 94%
rename from docs/source/features/quantization/int4.md
rename to docs/features/quantization/int4.md
index 7a0ab4ad229e..b7d09206365f 100644
--- a/docs/source/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -1,14 +1,14 @@
-(int4)=
-
-# INT4 W4A16
+---
+title: INT4 W4A16
+---
+[](){ #int4 }
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
-:::{note}
-INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
-:::
+!!! note
+ INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
## Prerequisites
@@ -121,9 +121,8 @@ $ lm_eval --model vllm \
--batch_size 'auto'
```
-:::{note}
-Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-:::
+!!! note
+ Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
## Best Practices
diff --git a/docs/source/features/quantization/int8.md b/docs/features/quantization/int8.md
similarity index 92%
rename from docs/source/features/quantization/int8.md
rename to docs/features/quantization/int8.md
index 1e4b01d35575..1d9fba9dc87f 100644
--- a/docs/source/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -1,15 +1,15 @@
-(int8)=
-
-# INT8 W8A8
+---
+title: INT8 W8A8
+---
+[](){ #int8 }
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
-:::{note}
-INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
-:::
+!!! note
+ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
## Prerequisites
@@ -125,9 +125,8 @@ $ lm_eval --model vllm \
--batch_size 'auto'
```
-:::{note}
-Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-:::
+!!! note
+ Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
## Best Practices
diff --git a/docs/source/features/quantization/modelopt.md b/docs/features/quantization/modelopt.md
similarity index 100%
rename from docs/source/features/quantization/modelopt.md
rename to docs/features/quantization/modelopt.md
diff --git a/docs/source/features/quantization/quantized_kvcache.md b/docs/features/quantization/quantized_kvcache.md
similarity index 98%
rename from docs/source/features/quantization/quantized_kvcache.md
rename to docs/features/quantization/quantized_kvcache.md
index 86e6354ec82e..e3ebd024bab3 100644
--- a/docs/source/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@@ -1,6 +1,7 @@
-(quantized-kvcache)=
-
-# Quantized KV Cache
+---
+title: Quantized KV Cache
+---
+[](){ #quantized-kvcache }
## FP8 KV Cache
diff --git a/docs/source/features/quantization/quark.md b/docs/features/quantization/quark.md
similarity index 94%
rename from docs/source/features/quantization/quark.md
rename to docs/features/quantization/quark.md
index 955890dbc75b..51da98cc09d3 100644
--- a/docs/source/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@@ -1,6 +1,7 @@
-(quark)=
-
-# AMD QUARK
+---
+title: AMD QUARK
+---
+[](){ #quark }
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check
for further details. Here we use FP8 per-tensor quantization on weight, activation,
kv-cache and the quantization algorithm is AutoSmoothQuant.
-:::{note}
-Note the quantization algorithm needs a JSON config file and the config file is located in
-[Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
-under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
-AutoSmoothQuant config file for Llama is
-`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
-:::
+!!! note
+ Note the quantization algorithm needs a JSON config file and the config file is located in
+ [Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
+ under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
+ AutoSmoothQuant config file for Llama is
+ `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
```python
from quark.torch.quantization import (Config, QuantizationConfig,
diff --git a/docs/features/quantization/supported_hardware.md b/docs/features/quantization/supported_hardware.md
new file mode 100644
index 000000000000..2967bf9c7504
--- /dev/null
+++ b/docs/features/quantization/supported_hardware.md
@@ -0,0 +1,28 @@
+---
+title: Supported Hardware
+---
+[](){ #quantization-supported-hardware }
+
+The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
+
+| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Inferentia | Google TPU |
+|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
+| AWQ | โ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ ๏ธ | โ ๏ธ | โ | โ |
+| GPTQ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ ๏ธ | โ ๏ธ | โ | โ |
+| Marlin (GPTQ/AWQ/FP8) | โ | โ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ | โ |
+| INT8 (W8A8) | โ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ ๏ธ | โ | โ ๏ธ |
+| FP8 (W8A8) | โ | โ | โ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ |
+| BitBLAS (GPTQ) | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ | โ |
+| AQLM | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ | โ |
+| bitsandbytes | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ | โ |
+| DeepSpeedFP | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ | โ |
+| GGUF | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ ๏ธ | โ | โ | โ | โ |
+
+- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
+- โ ๏ธ indicates that the quantization method is supported on the specified hardware.
+- โ indicates that the quantization method is not supported on the specified hardware.
+
+!!! note
+ This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
+
+ For the most up-to-date information on hardware support and quantization methods, please refer to or consult with the vLLM development team.
diff --git a/docs/source/features/quantization/torchao.md b/docs/features/quantization/torchao.md
similarity index 86%
rename from docs/source/features/quantization/torchao.md
rename to docs/features/quantization/torchao.md
index 82100c6ddcac..a7a517af85aa 100644
--- a/docs/source/features/quantization/torchao.md
+++ b/docs/features/quantization/torchao.md
@@ -7,7 +7,9 @@ We recommend installing the latest torchao nightly with
```console
# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
-pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126
+pip install \
+ --pre torchao>=10.0.0 \
+ --index-url https://download.pytorch.org/whl/nightly/cu126
```
## Quantizing HuggingFace Models
@@ -20,7 +22,12 @@ from torchao.quantization import Int8WeightOnlyConfig
model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
-quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
+quantized_model = AutoModelForCausalLM.from_pretrained(
+ model_name,
+ torch_dtype="auto",
+ device_map="auto",
+ quantization_config=quantization_config
+)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
diff --git a/docs/source/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md
similarity index 96%
rename from docs/source/features/reasoning_outputs.md
rename to docs/features/reasoning_outputs.md
index bf4f8901a11a..cbcb246912f4 100644
--- a/docs/source/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@@ -1,6 +1,7 @@
-(reasoning-outputs)=
-
-# Reasoning Outputs
+---
+title: Reasoning Outputs
+---
+[](){ #reasoning-outputs }
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
@@ -17,17 +18,17 @@ vLLM currently supports the following reasoning models:
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | โ | โ |
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | โ |
-:::{note}
-IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
-The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
-:::
+!!! note
+ IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
+ The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
## Quickstart
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
```bash
-vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
+vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+ --reasoning-parser deepseek_r1
```
Next, make a request to the model that should return the reasoning content in the response.
@@ -167,12 +168,10 @@ client = OpenAI(
models = client.models.list()
model = models.data[0].id
-
class People(BaseModel):
name: str
age: int
-
json_schema = People.model_json_schema()
prompt = ("Generate a JSON with the name and age of one random person.")
diff --git a/docs/source/features/spec_decode.md b/docs/features/spec_decode.md
similarity index 91%
rename from docs/source/features/spec_decode.md
rename to docs/features/spec_decode.md
index f16e0d96522d..5080960f72dd 100644
--- a/docs/source/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@@ -1,16 +1,15 @@
-(spec-decode)=
+---
+title: Speculative Decoding
+---
+[](){ #spec-decode }
-# Speculative Decoding
+!!! warning
+ Please note that speculative decoding in vLLM is not yet optimized and does
+ not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
+ The work to optimize it is ongoing and can be followed here:
-:::{warning}
-Please note that speculative decoding in vLLM is not yet optimized and does
-not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
-The work to optimize it is ongoing and can be followed here:
-:::
-
-:::{warning}
-Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-:::
+!!! warning
+ Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
@@ -46,14 +45,18 @@ for output in outputs:
To perform the same with an online mode launch the server:
```bash
-python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
- --seed 42 -tp 1 --gpu_memory_utilization 0.8 \
+python -m vllm.entrypoints.openai.api_server \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --model facebook/opt-6.7b \
+ --seed 42 \
+ -tp 1 \
+ --gpu_memory_utilization 0.8 \
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
```
-:::{warning}
-Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
-:::
+!!! warning
+ Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
Then use a client:
@@ -172,7 +175,7 @@ A variety of speculative models of this type are available on HF hub:
## Speculating using EAGLE based draft models
The following code configures vLLM to use speculative decoding where proposals are generated by
-an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here]().
+an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
```python
from vllm import LLM, SamplingParams
@@ -255,7 +258,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
- titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
+ titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
@@ -264,7 +267,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
## Resources for vLLM contributors
diff --git a/docs/source/features/structured_outputs.md b/docs/features/structured_outputs.md
similarity index 96%
rename from docs/source/features/structured_outputs.md
rename to docs/features/structured_outputs.md
index 03119ec7441c..f96b598cff98 100644
--- a/docs/source/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@@ -1,6 +1,7 @@
-(structured-outputs)=
-
-# Structured Outputs
+---
+title: Structured Outputs
+---
+[](){ #structured-outputs }
vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or
@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
-You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server) page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the
@@ -83,13 +84,11 @@ class CarType(str, Enum):
truck = "Truck"
coupe = "Coupe"
-
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
-
json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create(
@@ -105,11 +104,10 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-:::{tip}
-While not strictly necessary, normally itยดs better to indicate in the prompt the
-JSON schema and how the fields should be populated. This can improve the
-results notably in most cases.
-:::
+!!! tip
+ While not strictly necessary, normally itยดs better to indicate in the prompt the
+ JSON schema and how the fields should be populated. This can improve the
+ results notably in most cases.
Finally we have the `guided_grammar` option, which is probably the most
difficult to use, but itยดs really powerful. It allows us to define complete
@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant
from pydantic import BaseModel
from openai import OpenAI
-
class Info(BaseModel):
name: str
age: int
-
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
@@ -199,17 +195,14 @@ from typing import List
from pydantic import BaseModel
from openai import OpenAI
-
class Step(BaseModel):
explanation: str
output: str
-
class MathResponse(BaseModel):
steps: list[Step]
final_answer: str
-
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
diff --git a/docs/source/features/tool_calling.md b/docs/features/tool_calling.md
similarity index 95%
rename from docs/source/features/tool_calling.md
rename to docs/features/tool_calling.md
index 2795b769345e..6ee1060dd050 100644
--- a/docs/source/features/tool_calling.md
+++ b/docs/features/tool_calling.md
@@ -93,7 +93,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha
## Required Function Calling
-vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#feature-model) for the V1 engine.
+vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/usage/v1_guide.html#feature-model) for the V1 engine.
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
@@ -158,13 +158,13 @@ All Llama 3.1, 3.2 and 4 models should be supported.
* `meta-llama/Llama-3.2-*`
* `meta-llama/Llama-4-*`
-The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below.
+The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for llama 4 models, it is recommended to use the `llama4_pythonic` tool parser.
Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
Known issues:
-1. Parallel tool calls are not supported.
+1. Parallel tool calls are not supported for llama 3, but it is supported in llama 4 models.
2. The model can generate parameters with a wrong format, such as generating
an array serialized as string instead of an array.
@@ -177,11 +177,10 @@ images.
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
-VLLM also provides a JSON based chat template for Llama 4:
-* - this is based on the "official" chat template for the Llama 4
-models, but tweaked so that it works better with vLLM.
+VLLM also provides a pythonic and JSON based chat template for Llama 4, but pythonic tool calling is recommended:
+* - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
-For Llama 4 use `--tool-call-parser llama4_json examples/tool_chat_template_llama4_json.jinja`.
+For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
#### IBM Granite
@@ -323,7 +322,6 @@ class ExampleToolParser(ToolParser):
tool_calls=[],
content=text)
-
```
Then you can use this plugin in the command line like this.
diff --git a/docs/getting_started/installation/.nav.yml b/docs/getting_started/installation/.nav.yml
new file mode 100644
index 000000000000..7acfc015ff50
--- /dev/null
+++ b/docs/getting_started/installation/.nav.yml
@@ -0,0 +1,5 @@
+nav:
+ - README.md
+ - gpu.md
+ - cpu.md
+ - ai_accelerator.md
\ No newline at end of file
diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md
new file mode 100644
index 000000000000..36bb16cc0224
--- /dev/null
+++ b/docs/getting_started/installation/README.md
@@ -0,0 +1,20 @@
+---
+title: Installation
+---
+[](){ #installation-index }
+
+vLLM supports the following hardware platforms:
+
+- [GPU](gpu.md)
+ - [NVIDIA CUDA](gpu.md#nvidia-cuda)
+ - [AMD ROCm](gpu.md#amd-rocm)
+ - [Intel XPU](gpu.md#intel-xpu)
+- [CPU](cpu.md)
+ - [Intel/AMD x86](cpu.md#intelamd-x86)
+ - [ARM AArch64](cpu.md#arm-aarch64)
+ - [Apple silicon](cpu.md#apple-silicon)
+ - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
+- [Other AI accelerators](ai_accelerator.md)
+ - [Google TPU](ai_accelerator.md#google-tpu)
+ - [Intel Gaudi](ai_accelerator.md#intel-gaudi)
+ - [AWS Neuron](ai_accelerator.md#aws-neuron)
diff --git a/docs/getting_started/installation/ai_accelerator.md b/docs/getting_started/installation/ai_accelerator.md
new file mode 100644
index 000000000000..a4f136a172fe
--- /dev/null
+++ b/docs/getting_started/installation/ai_accelerator.md
@@ -0,0 +1,117 @@
+# Other AI accelerators
+
+vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
+
+## Requirements
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
+
+## Configure a new environment
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
+
+## Set up using Python
+
+### Pre-built wheels
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
+
+### Build wheel from source
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
+
+## Set up using Docker
+
+### Pre-built images
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
+
+### Build image from source
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
+
+## Extra information
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
diff --git a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md b/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
similarity index 83%
rename from docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
rename to docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
index 78938de317c4..00935a37417e 100644
--- a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
@@ -1,12 +1,12 @@
-# Installation
+# --8<-- [start:installation]
This tab provides instructions on running vLLM with Intel Gaudi devices.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: Ubuntu 22.04 LTS
- Python: 3.10
@@ -45,16 +45,27 @@ Use the following commands to run a Docker image:
```console
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
-docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
+docker run \
+ -it \
+ --runtime=habana \
+ -e HABANA_VISIBLE_DEVICES=all \
+ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
+ --cap-add=sys_nice \
+ --net=host \
+ --ipc=host \
+ vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Intel Gaudi wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
To build and install vLLM from source, run:
@@ -75,29 +86,39 @@ pip install -r requirements/hpu.txt
python setup.py develop
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
Currently, there are no pre-built Intel Gaudi images.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
```console
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
-docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
+docker run \
+ -it \
+ --runtime=habana \
+ -e HABANA_VISIBLE_DEVICES=all \
+ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
+ --cap-add=sys_nice \
+ --net=host \
+ --rm vllm-hpu-env
```
-:::{tip}
-If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-:::
+!!! tip
+ If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
## Supported features
-- [Offline inference](#offline-inference)
-- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
+- [Offline inference][offline-inference]
+- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -157,41 +178,25 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
-:::{list-table} vLLM execution modes
-:widths: 25 25 50
-:header-rows: 1
-
-- * `PT_HPU_LAZY_MODE`
- * `enforce_eager`
- * execution mode
-- * 0
- * 0
- * torch.compile
-- * 0
- * 1
- * PyTorch eager mode
-- * 1
- * 0
- * HPU Graphs
-- * 1
- * 1
- * PyTorch lazy mode
-:::
-
-:::{warning}
-In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
-:::
-
-(gaudi-bucketing-mechanism)=
+| `PT_HPU_LAZY_MODE` | `enforce_eager` | execution mode |
+|----------------------|-------------------|--------------------|
+| 0 | 0 | torch.compile |
+| 0 | 1 | PyTorch eager mode |
+| 1 | 0 | HPU Graphs |
+ vLLM execution modes
+
+!!! warning
+ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
+
+[](){ #gaudi-bucketing-mechanism }
### Bucketing mechanism
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
-:::{note}
-Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
-:::
+!!! note
+ Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
@@ -224,15 +229,13 @@ min = 128, step = 128, max = 512
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
-:::{warning}
-If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
-:::
+!!! warning
+ If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
-:::{note}
-Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
-:::
+!!! note
+ Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
### Warmup
@@ -252,11 +255,10 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```
-This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
+This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
-:::{tip}
-Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
-:::
+!!! tip
+ Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
### HPU Graph capture
@@ -271,9 +273,8 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
-:::{note}
-`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
-:::
+!!! note
+ `gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
@@ -282,9 +283,8 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
-:::{note}
-`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
-:::
+!!! note
+ `VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
@@ -401,3 +401,4 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md b/docs/getting_started/installation/ai_accelerator/neuron.inc.md
similarity index 74%
rename from docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
rename to docs/getting_started/installation/ai_accelerator/neuron.inc.md
index b4bfb696faa2..f08c78fba6c8 100644
--- a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/neuron.inc.md
@@ -1,14 +1,14 @@
-# Installation
+# --8<-- [start:installation]
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
Paged Attention and Chunked Prefill are currently in development and will be available soon.
Data types currently supported in Neuron SDK are FP16 and BF16.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: Linux
- Python: 3.9 -- 3.11
@@ -38,7 +38,8 @@ The installation of drivers and tools wouldn't be necessary, if [Deep Learning A
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
-:::
+!!! note
+ The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
Following instructions are applicable to Neuron SDK 2.16 and beyond.
@@ -94,12 +97,17 @@ source aws_neuron_venv_pytorch/bin/activate
# Install Jupyter notebook kernel
pip install ipykernel
-python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
+python3.10 -m ipykernel install \
+ --user \
+ --name aws_neuron_venv_pytorch \
+ --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels
# Set pip repository pointing to the Neuron repository
-python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
+python -m pip config set \
+ global.extra-index-url \
+ https://pip.repos.neuron.amazonaws.com
# Install wget, awscli
python -m pip install wget
@@ -122,18 +130,23 @@ VLLM_TARGET_DEVICE="neuron" pip install .
If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed.
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
Currently, there are no pre-built Neuron images.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-See for instructions on building the Docker image.
+See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
Make sure to use in place of the default Dockerfile.
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
There is no extra information for this device.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md b/docs/getting_started/installation/ai_accelerator/tpu.inc.md
similarity index 55%
rename from docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
rename to docs/getting_started/installation/ai_accelerator/tpu.inc.md
index 4459cc61e1cd..d0b168120137 100644
--- a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/tpu.inc.md
@@ -1,4 +1,4 @@
-# Installation
+# --8<-- [start:installation]
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
@@ -30,11 +30,11 @@ For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tp
You may need additional persistent storage for your TPU VMs. For more
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
-:::{attention}
-There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4
@@ -51,10 +51,9 @@ When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
-:::{note}
-In all of the following commands, replace the ALL CAPS parameter names with
-appropriate values. See the parameter descriptions table for more information.
-:::
+!!! note
+ In all of the following commands, replace the ALL CAPS parameter names with
+ appropriate values. See the parameter descriptions table for more information.
### Provision Cloud TPUs with GKE
@@ -79,33 +78,15 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT
```
-:::{list-table} Parameter descriptions
-:header-rows: 1
-
-- * Parameter name
- * Description
-- * QUEUED_RESOURCE_ID
- * The user-assigned ID of the queued resource request.
-- * TPU_NAME
- * The user-assigned name of the TPU which is created when the queued
- resource request is allocated.
-- * PROJECT_ID
- * Your Google Cloud project
-- * ZONE
- * The GCP zone where you want to create your Cloud TPU. The value you use
- depends on the version of TPUs you are using. For more information, see
- `TPU regions and zones `_
-- * ACCELERATOR_TYPE
- * The TPU version you want to use. Specify the TPU version, for example
- `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information,
- see [TPU versions](https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions).
-- * RUNTIME_VERSION
- * The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes).
-- * SERVICE_ACCOUNT
- * The email address for your service account. You can find it in the IAM
- Cloud Console under *Service Accounts*. For example:
- `tpu-service-account@.iam.gserviceaccount.com`
-:::
+| Parameter name | Description |
+|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
+| TPU_NAME | The user-assigned name of the TPU which is created when the queued |
+| PROJECT_ID | Your Google Cloud project |
+| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use |
+| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example |
+| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). |
+ Parameter descriptions
Connect to your TPU using SSH:
@@ -113,13 +94,16 @@ Connect to your TPU using SSH:
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
```
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built TPU wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
Install Miniconda:
@@ -161,13 +145,16 @@ Run the setup script:
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-See for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
+See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
You can use to build a Docker image with TPU support.
@@ -182,31 +169,30 @@ Run the Docker image with the following command:
docker run --privileged --net host --shm-size=16G -it vllm-tpu
```
-:::{note}
-Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
-possible input shapes and compiles an XLA graph for each shape. The
-compilation time may take 20~30 minutes in the first run. However, the
-compilation time reduces to ~5 minutes afterwards because the XLA graphs are
-cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
-:::
+!!! note
+ Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
+ possible input shapes and compiles an XLA graph for each shape. The
+ compilation time may take 20~30 minutes in the first run. However, the
+ compilation time reduces to ~5 minutes afterwards because the XLA graphs are
+ cached in the disk (in `VLLM_XLA_CACHE_PATH` or `~/.cache/vllm/xla_cache` by default).
-:::{tip}
-If you encounter the following error:
+!!! tip
+ If you encounter the following error:
-```console
-from torch._C import * # noqa: F403
-ImportError: libopenblas.so.0: cannot open shared object file: No such
-file or directory
-```
-
-Install OpenBLAS with the following command:
+ ```console
+ from torch._C import * # noqa: F403
+ ImportError: libopenblas.so.0: cannot open shared object file: No such
+ file or directory
+ ```
-```console
-sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
-```
+ Install OpenBLAS with the following command:
-:::
+ ```console
+ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+ ```
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
There is no extra information for this device.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md
similarity index 74%
rename from docs/source/getting_started/installation/cpu.md
rename to docs/getting_started/installation/cpu.md
index 2c0ec60d7100..18c96b264ad8 100644
--- a/docs/source/getting_started/installation/cpu.md
+++ b/docs/getting_started/installation/cpu.md
@@ -2,107 +2,47 @@
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
-:::::{tab-set}
-:sync-group: device
+=== "Intel/AMD x86"
-::::{tab-item} Intel/AMD x86
-:selected:
-:sync: x86
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
-:::{include} cpu/x86.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
+=== "ARM AArch64"
-::::
+ --8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
-::::{tab-item} ARM AArch64
-:sync: arm
+=== "Apple silicon"
-:::{include} cpu/arm.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
+ --8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
-::::
+=== "IBM Z (S390X)"
-::::{tab-item} Apple silicon
-:sync: apple
-
-:::{include} cpu/apple.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} IBM Z (S390X)
-:sync: s390x
-
-:::{include} cpu/s390x.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
## Requirements
- Python: 3.9 -- 3.12
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Intel/AMD x86
-:sync: x86
-
-:::{include} cpu/x86.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} ARM AArch64
-:sync: arm
-
-:::{include} cpu/arm.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
+=== "Intel/AMD x86"
-::::
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
-::::{tab-item} Apple silicon
-:sync: apple
+=== "ARM AArch64"
-:::{include} cpu/apple.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
+ --8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
-::::
+=== "Apple silicon"
-::::{tab-item} IBM Z (S390X)
-:sync: s390x
+ --8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
-:::{include} cpu/s390x.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
+=== "IBM Z (S390X)"
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
## Set up using Python
### Create a new Python environment
-:::{include} python_env_setup.inc.md
-:::
+--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
### Pre-built wheels
@@ -110,69 +50,29 @@ Currently, there are no pre-built CPU wheels.
### Build wheel from source
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Intel/AMD x86
-:sync: x86
-
-:::{include} cpu/x86.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} ARM AArch64
-:sync: arm
+=== "Intel/AMD x86"
-:::{include} cpu/arm.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
-::::
+=== "ARM AArch64"
-::::{tab-item} Apple silicon
-:sync: apple
+ --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
-:::{include} cpu/apple.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
+=== "Apple silicon"
-::::
+ --8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
-::::{tab-item} IBM Z (s390x)
-:sync: s390x
+=== "IBM Z (s390x)"
-:::{include} cpu/s390x.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Intel/AMD x86
-:sync: x86
-
-:::{include} cpu/x86.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
+=== "Intel/AMD x86"
-:::::
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
### Build image from source
@@ -192,13 +92,11 @@ $ docker run --rm \
other vLLM OpenAI server arguments
```
-::::{tip}
-For ARM or Apple silicon, use `docker/Dockerfile.arm`
-::::
+!!! tip
+ For ARM or Apple silicon, use `docker/Dockerfile.arm`
-::::{tip}
-For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
-::::
+!!! tip
+ For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
## Supported features
diff --git a/docs/source/getting_started/installation/cpu/apple.inc.md b/docs/getting_started/installation/cpu/apple.inc.md
similarity index 58%
rename from docs/source/getting_started/installation/cpu/apple.inc.md
rename to docs/getting_started/installation/cpu/apple.inc.md
index 7bc9e85ecd96..7a91e3ce5e5b 100644
--- a/docs/source/getting_started/installation/cpu/apple.inc.md
+++ b/docs/getting_started/installation/cpu/apple.inc.md
@@ -1,24 +1,27 @@
-# Installation
+# --8<-- [start:installation]
vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0`
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
@@ -29,9 +32,8 @@ pip install -r requirements/cpu.txt
pip install -e .
```
-:::{note}
-On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
-:::
+!!! note
+ On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
#### Troubleshooting
@@ -51,10 +53,15 @@ If the build has error like the following snippet where standard C++ headers can
1 error generated.
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/getting_started/installation/cpu/arm.inc.md b/docs/getting_started/installation/cpu/arm.inc.md
new file mode 100644
index 000000000000..59b71dcaf911
--- /dev/null
+++ b/docs/getting_started/installation/cpu/arm.inc.md
@@ -0,0 +1,41 @@
+# --8<-- [start:installation]
+
+vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
+
+ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
+
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
+
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
+
+- OS: Linux
+- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
+- Instruction Set Architecture (ISA): NEON support is required
+
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
+
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
+
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
+
+--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
+
+Testing has been conducted on AWS Graviton3 instances for compatibility.
+
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
+
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
+
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
+
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/cpu/build.inc.md b/docs/getting_started/installation/cpu/build.inc.md
similarity index 96%
rename from docs/source/getting_started/installation/cpu/build.inc.md
rename to docs/getting_started/installation/cpu/build.inc.md
index f385f3d5b198..7d6472afa7ea 100644
--- a/docs/source/getting_started/installation/cpu/build.inc.md
+++ b/docs/getting_started/installation/cpu/build.inc.md
@@ -32,3 +32,5 @@ If you want to develop vllm, install it in editable mode instead.
```console
VLLM_TARGET_DEVICE=cpu python setup.py develop
```
+
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/cpu/s390x.inc.md b/docs/getting_started/installation/cpu/s390x.inc.md
similarity index 64%
rename from docs/source/getting_started/installation/cpu/s390x.inc.md
rename to docs/getting_started/installation/cpu/s390x.inc.md
index 9b41173b44ce..670485feefb6 100644
--- a/docs/source/getting_started/installation/cpu/s390x.inc.md
+++ b/docs/getting_started/installation/cpu/s390x.inc.md
@@ -1,25 +1,28 @@
-# Installation
+# --8<-- [start:installation]
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.
Currently the CPU implementation for s390x architecture supports FP32 datatype only.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision`
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
@@ -39,9 +42,8 @@ curl https://sh.rustup.rs -sSf | sh -s -- -y && \
Execute the following commands to build and install vLLM from the source.
-::::{tip}
-Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
-::::
+!!! tip
+ Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
```console
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
@@ -53,10 +55,15 @@ Please build the following dependencies, `torchvision`, `pyarrow` from the sourc
pip install dist/*.whl
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/getting_started/installation/cpu/x86.inc.md b/docs/getting_started/installation/cpu/x86.inc.md
new file mode 100644
index 000000000000..9434eeea8b4a
--- /dev/null
+++ b/docs/getting_started/installation/cpu/x86.inc.md
@@ -0,0 +1,46 @@
+# --8<-- [start:installation]
+
+vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
+
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
+
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
+
+- OS: Linux
+- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
+- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
+
+!!! tip
+ [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
+
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
+
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
+
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
+
+--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
+
+!!! note
+ - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
+ - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
+
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
+
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
+
+See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
+
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
+
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/device.template.md b/docs/getting_started/installation/device.template.md
similarity index 100%
rename from docs/source/getting_started/installation/device.template.md
rename to docs/getting_started/installation/device.template.md
diff --git a/docs/getting_started/installation/gpu.md b/docs/getting_started/installation/gpu.md
new file mode 100644
index 000000000000..3c983f600673
--- /dev/null
+++ b/docs/getting_started/installation/gpu.md
@@ -0,0 +1,124 @@
+# GPU
+
+vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation"
+
+## Requirements
+
+- OS: Linux
+- Python: 3.9 -- 3.12
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements"
+
+## Set up using Python
+
+### Create a new Python environment
+
+--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:create-a-new-python-environment"
+
+=== "AMD ROCm"
+
+ There is no extra information on creating a new Python environment for this device.
+
+=== "Intel XPU"
+
+ There is no extra information on creating a new Python environment for this device.
+
+### Pre-built wheels
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels"
+
+[](){ #build-from-source }
+
+### Build wheel from source
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source"
+
+## Set up using Docker
+
+### Pre-built images
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images"
+
+### Build image from source
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source"
+
+## Supported features
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features"
diff --git a/docs/source/getting_started/installation/gpu/cuda.inc.md b/docs/getting_started/installation/gpu/cuda.inc.md
similarity index 62%
rename from docs/source/getting_started/installation/gpu/cuda.inc.md
rename to docs/getting_started/installation/gpu/cuda.inc.md
index 06915f09dd51..64dccef63d73 100644
--- a/docs/source/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/getting_started/installation/gpu/cuda.inc.md
@@ -1,43 +1,52 @@
-# Installation
+# --8<-- [start:installation]
-vLLM contains pre-compiled C++ and CUDA (12.6) binaries.
+vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
### Create a new Python environment
-:::{note}
-PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details.
-:::
+!!! note
+ PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
-Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-from-source) for more details.
+Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below][build-from-source] for more details.
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
You can install vLLM using either `pip` or `uv pip`:
```console
-# Install vLLM with CUDA 12.6.
-pip install vllm # If you are using pip.
-uv pip install vllm # If you are using uv.
+# Install vLLM with CUDA 12.8.
+# If you are using pip.
+pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
+# If you are using uv.
+uv pip install vllm --torch-backend=auto
```
-As of now, vLLM's binaries are compiled with CUDA 12.6 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.8, 11.8, and public PyTorch release versions:
+We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`). If this doesn't work, try running `uv self update` to update `uv` first.
+
+!!! note
+ NVIDIA Blackwell GPUs (B200, GB200) require a minimum of CUDA 12.8, so make sure you are installing PyTorch wheels with at least that version. PyTorch itself offers a [dedicated interface](https://pytorch.org/get-started/locally/) to determine the appropriate pip command to run for a given target configuration.
+
+As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
```console
# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.6.1.post1
-export PYTHON_VERSION=310
-pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
+export PYTHON_VERSION=312
+uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
-(install-the-latest-code)=
+[](){ #install-the-latest-code }
#### Install the latest code
@@ -46,40 +55,47 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
##### Install the latest code using `pip`
```console
-pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+pip install -U vllm \
+ --pre \
+ --extra-index-url https://wheels.vllm.ai/nightly
```
`--pre` is required for `pip` to consider pre-released versions.
-If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
+Another way to install the latest code is to use `uv`:
```console
-export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
-pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
+uv pip install -U vllm \
+ --torch-backend=auto \
+ --extra-index-url https://wheels.vllm.ai/nightly
```
-Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
-
-##### Install the latest code using `uv`
+##### Install specific revisions using `pip`
-Another way to install the latest code is to use `uv`:
+If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
```console
-uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
+export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
+pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
```
+Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
+
##### Install specific revisions using `uv`
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```console
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
-uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
+uv pip install vllm \
+ --torch-backend=auto \
+ --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
```
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
#### Set up using Python-only build (without compilation)
@@ -92,15 +108,15 @@ VLLM_USE_PRECOMPILED=1 pip install --editable .
```
This command will do the following:
+
1. Look for the current branch in your vLLM clone.
-2. Identify the corresponding base commit in the main branch.
-3. Download the pre-built wheel of the base commit.
-4. Use its compiled libraries in the installation.
+1. Identify the corresponding base commit in the main branch.
+1. Download the pre-built wheel of the base commit.
+1. Use its compiled libraries in the installation.
-:::{note}
-1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
-2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
-:::
+!!! note
+ 1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
+ 2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
@@ -110,12 +126,11 @@ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vll
pip install --editable .
```
-You can find more information about vLLM's wheels in .
+You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code].
-:::{note}
-There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
-It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to for instructions on how to install a specified wheel.
-:::
+!!! note
+ There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
+ It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [install-the-latest-code][install-the-latest-code] for instructions on how to install a specified wheel.
#### Full build (with compilation)
@@ -127,17 +142,16 @@ cd vllm
pip install -e .
```
-:::{tip}
-Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
+!!! tip
+ Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
-For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
-As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
+ For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
+ As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
-When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
+ When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
-[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
-The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
-:::
+ [sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
+ The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
##### Use an existing PyTorch installation
@@ -184,7 +198,11 @@ Additionally, if you have trouble building vLLM, we recommend using the NVIDIA P
```console
# Use `--ipc=host` to make sure the shared memory is large enough.
-docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
+docker run \
+ --gpus all \
+ -it \
+ --rm \
+ --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
```
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
@@ -212,11 +230,13 @@ export VLLM_TARGET_DEVICE=empty
pip install -e .
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-See for instructions on using the official Docker image.
+See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
Another way to access the latest code is to use the docker images:
@@ -229,10 +249,12 @@ These docker images are used for CI and testing only, and they are not intended
The latest code can contain bugs and may not be stable. Please use it with caution.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-See for instructions on building the Docker image.
+See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
## Supported features
-See compatibility matrix for feature support information.
+See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/gpu/rocm.inc.md b/docs/getting_started/installation/gpu/rocm.inc.md
similarity index 66%
rename from docs/source/getting_started/installation/gpu/rocm.inc.md
rename to docs/getting_started/installation/gpu/rocm.inc.md
index dc74368fe2c9..0029b3a24496 100644
--- a/docs/source/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/getting_started/installation/gpu/rocm.inc.md
@@ -1,28 +1,31 @@
-# Installation
+# --8<-- [start:installation]
vLLM supports AMD GPUs with ROCm 6.3.
-:::{attention}
-There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201)
- ROCm 6.3
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built ROCm wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
-- [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
-- [PyTorch](https://pytorch.org/)
+ - [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
+ - [PyTorch](https://pytorch.org/)
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.3_ubuntu24.04_py3.12_pytorch_release_2.4.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.
@@ -49,9 +52,8 @@ Currently, there are no pre-built ROCm wheels.
cd ../..
```
- :::{note}
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
- :::
+ !!! note
+ If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention)
@@ -69,9 +71,8 @@ Currently, there are no pre-built ROCm wheels.
cd ..
```
- :::{note}
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
- :::
+ !!! note
+ You might need to downgrade the "ninja" version to 1.10 as it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
@@ -84,55 +85,56 @@ Currently, there are no pre-built ROCm wheels.
python3 setup.py develop
```
- :::{note}
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- :::
+ !!! note
+ You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
```bash
- $ pip install --upgrade pip
+ pip install --upgrade pip
# Build & install AMD SMI
- $ pip install /opt/rocm/share/amd_smi
+ pip install /opt/rocm/share/amd_smi
# Install dependencies
- $ pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
- $ pip install "numpy<2"
- $ pip install -r requirements/rocm.txt
+ pip install --upgrade numba \
+ scipy \
+ huggingface-hub[cli,hf_transfer] \
+ setuptools_scm
+ pip install "numpy<2"
+ pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300.
- $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
- $ python3 setup.py develop
+ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+ python3 setup.py develop
```
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
- :::{tip}
- - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
- :::
+ !!! tip
+ - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
+ - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
+ - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
+ - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
-:::{tip}
-- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
- For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
-:::
+!!! tip
+ - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
+ For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
## Set up using Docker (Recommended)
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinctโข MI300X accelerator.
-:::{tip}
-Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
-for instructions on how to use this prebuilt docker image.
-:::
+!!! tip
+ Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
+ for instructions on how to use this prebuilt docker image.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
Building the Docker image from source is the recommended way to use vLLM with ROCm.
@@ -155,7 +157,9 @@ It is important that the user kicks off the docker build using buildkit. Either
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
```console
-DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm_base -t rocm/vllm-dev:base .
+DOCKER_BUILDKIT=1 docker build \
+ -f docker/Dockerfile.rocm_base \
+ -t rocm/vllm-dev:base .
```
#### Build an image with vLLM
@@ -190,7 +194,11 @@ DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
```console
-DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" -f docker/Dockerfile.rocm -t vllm-rocm .
+DOCKER_BUILDKIT=1 docker build \
+ --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
+ -f docker/Dockerfile.rocm \
+ -t vllm-rocm \
+ .
```
To run the above docker image `vllm-rocm`, use the below command:
@@ -213,4 +221,5 @@ Where the `` is the location where the model is stored, for examp
## Supported features
-See compatibility matrix for feature support information.
+See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/gpu/xpu.inc.md b/docs/getting_started/installation/gpu/xpu.inc.md
similarity index 67%
rename from docs/source/getting_started/installation/gpu/xpu.inc.md
rename to docs/getting_started/installation/gpu/xpu.inc.md
index 4ab41a21c2a1..bee9a7ebb717 100644
--- a/docs/source/getting_started/installation/gpu/xpu.inc.md
+++ b/docs/getting_started/installation/gpu/xpu.inc.md
@@ -1,23 +1,26 @@
-# Installation
+# --8<-- [start:installation]
vLLM initially supports basic model inference and serving on Intel GPU platform.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
- OneAPI requirements: oneAPI 2025.0
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built XPU wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
- First, install required driver and Intel OneAPI 2025.0 or later.
- Second, install Python packages for vLLM XPU backend building:
@@ -35,18 +38,20 @@ pip install -v -r requirements/xpu.txt
VLLM_TARGET_DEVICE=xpu python setup.py install
```
-:::{note}
-- FP16 is the default data type in the current XPU backend. The BF16 data
- type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
-:::
+!!! note
+ - FP16 is the default data type in the current XPU backend. The BF16 data
+ type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
Currently, there are no pre-built XPU images.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
```console
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
@@ -66,7 +71,6 @@ XPU platform supports **tensor parallel** inference/serving and also supports **
python -m vllm.entrypoints.openai.api_server \
--model=facebook/opt-13b \
--dtype=bfloat16 \
- --device=xpu \
--max_model_len=1024 \
--distributed-executor-backend=ray \
--pipeline-parallel-size=2 \
@@ -74,3 +78,4 @@ python -m vllm.entrypoints.openai.api_server \
```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the helper script.
+# --8<-- [end:extra-information]
diff --git a/docs/getting_started/installation/python_env_setup.inc.md b/docs/getting_started/installation/python_env_setup.inc.md
new file mode 100644
index 000000000000..911301d68335
--- /dev/null
+++ b/docs/getting_started/installation/python_env_setup.inc.md
@@ -0,0 +1,6 @@
+It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
+
+```console
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+```
diff --git a/docs/source/getting_started/quickstart.md b/docs/getting_started/quickstart.md
similarity index 68%
rename from docs/source/getting_started/quickstart.md
rename to docs/getting_started/quickstart.md
index 298ba59f7d8b..d24e75e8141d 100644
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -1,11 +1,12 @@
-(quickstart)=
-
-# Quickstart
+---
+title: Quickstart
+---
+[](){ #quickstart }
This guide will help you quickly get started with vLLM to perform:
-- [Offline batched inference](#quickstart-offline)
-- [Online serving using OpenAI-compatible server](#quickstart-online)
+- [Offline batched inference][quickstart-offline]
+- [Online serving using OpenAI-compatible server][quickstart-online]
## Prerequisites
@@ -21,48 +22,49 @@ It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python env
```console
uv venv --python 3.12 --seed
source .venv/bin/activate
-uv pip install vllm
+uv pip install vllm --torch-backend=auto
```
-Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating an environment:
+`uv` can [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`).
+
+Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
```console
uv run --with vllm vllm --help
```
-You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
+You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
```console
conda create -n myenv python=3.12 -y
conda activate myenv
-pip install vllm
+pip install --upgrade uv
+uv pip install vllm --torch-backend=auto
```
-:::{note}
-For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
-:::
+!!! note
+ For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
-(quickstart-offline)=
+[](){ #quickstart-offline }
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script:
-The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
+The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
-- {class}`~vllm.LLM` is the main class for running offline inference with vLLM engine.
-- {class}`~vllm.SamplingParams` specifies the parameters for the sampling process.
+- [LLM][vllm.LLM] is the main class for running offline inference with vLLM engine.
+- [SamplingParams][vllm.SamplingParams] specifies the parameters for the sampling process.
```python
from vllm import LLM, SamplingParams
```
-The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
-:::{important}
-By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
+The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
+!!! warning
+ By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
-However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
-:::
+ However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
```python
prompts = [
@@ -74,15 +76,18 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
```
-The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](#supported-models).
+The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
```python
llm = LLM(model="facebook/opt-125m")
```
-:::{note}
-By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
-:::
+!!! note
+ By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
+
+ ```shell
+ export VLLM_USE_MODELSCOPE=True
+ ```
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
@@ -95,7 +100,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
-(quickstart-online)=
+[](){ #quickstart-online }
## OpenAI-Compatible Server
@@ -108,15 +113,13 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```
-:::{note}
-By default, the server uses a predefined chat template stored in the tokenizer.
-You can learn about overriding it [here](#chat-template).
-:::
-:::{important}
-By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
+!!! note
+ By default, the server uses a predefined chat template stored in the tokenizer.
+ You can learn about overriding it [here][chat-template].
+!!! warning
+ By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
-To disable this behavior, please pass `--generation-config vllm` when launching the server.
-:::
+ To disable this behavior, please pass `--generation-config vllm` when launching the server.
This server can be queried in the same format as OpenAI API. For example, to list the models:
@@ -207,6 +210,5 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
-```{attention}
-There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see for instructions on how to install it.
-```
+!!! warning
+ There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see for instructions on how to install it.
diff --git a/docs/make.bat b/docs/make.bat
deleted file mode 100644
index 747ffb7b3033..000000000000
--- a/docs/make.bat
+++ /dev/null
@@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
- set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=source
-set BUILDDIR=build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
- echo.
- echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
- echo.installed, then set the SPHINXBUILD environment variable to point
- echo.to the full path of the 'sphinx-build' executable. Alternatively you
- echo.may add the Sphinx directory to PATH.
- echo.
- echo.If you don't have Sphinx installed, grab it from
- echo.https://www.sphinx-doc.org/
- exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
diff --git a/docs/mkdocs/hooks/generate_examples.py b/docs/mkdocs/hooks/generate_examples.py
new file mode 100644
index 000000000000..6f290efe45c2
--- /dev/null
+++ b/docs/mkdocs/hooks/generate_examples.py
@@ -0,0 +1,162 @@
+# SPDX-License-Identifier: Apache-2.0
+import itertools
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Literal
+
+import regex as re
+
+ROOT_DIR = Path(__file__).parent.parent.parent.parent
+ROOT_DIR_RELATIVE = '../../../../..'
+EXAMPLE_DIR = ROOT_DIR / "examples"
+EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
+print(ROOT_DIR.resolve())
+print(EXAMPLE_DIR.resolve())
+print(EXAMPLE_DOC_DIR.resolve())
+
+
+def fix_case(text: str) -> str:
+ subs = {
+ "api": "API",
+ "cli": "CLI",
+ "cpu": "CPU",
+ "llm": "LLM",
+ "mae": "MAE",
+ "tpu": "TPU",
+ "aqlm": "AQLM",
+ "gguf": "GGUF",
+ "lora": "LoRA",
+ "rlhf": "RLHF",
+ "vllm": "vLLM",
+ "openai": "OpenAI",
+ "lmcache": "LMCache",
+ "multilora": "MultiLoRA",
+ "mlpspeculator": "MLPSpeculator",
+ r"fp\d+": lambda x: x.group(0).upper(), # e.g. fp16, fp32
+ r"int\d+": lambda x: x.group(0).upper(), # e.g. int8, int16
+ }
+ for pattern, repl in subs.items():
+ text = re.sub(rf'\b{pattern}\b', repl, text, flags=re.IGNORECASE)
+ return text
+
+
+@dataclass
+class Example:
+ """
+ Example class for generating documentation content from a given path.
+
+ Attributes:
+ path (Path): The path to the main directory or file.
+ category (str): The category of the document.
+ main_file (Path): The main file in the directory.
+ other_files (list[Path]): list of other files in the directory.
+ title (str): The title of the document.
+
+ Methods:
+ __post_init__(): Initializes the main_file, other_files, and title attributes.
+ determine_main_file() -> Path: Determines the main file in the given path.
+ determine_other_files() -> list[Path]: Determines other files in the directory excluding the main file.
+ determine_title() -> str: Determines the title of the document.
+ generate() -> str: Generates the documentation content.
+ """ # noqa: E501
+ path: Path
+ category: str = None
+ main_file: Path = field(init=False)
+ other_files: list[Path] = field(init=False)
+ title: str = field(init=False)
+
+ def __post_init__(self):
+ self.main_file = self.determine_main_file()
+ self.other_files = self.determine_other_files()
+ self.title = self.determine_title()
+
+ def determine_main_file(self) -> Path:
+ """
+ Determines the main file in the given path.
+ If the path is a file, it returns the path itself. Otherwise, it searches
+ for Markdown files (*.md) in the directory and returns the first one found.
+ Returns:
+ Path: The main file path, either the original path if it's a file or the first
+ Markdown file found in the directory.
+ Raises:
+ IndexError: If no Markdown files are found in the directory.
+ """ # noqa: E501
+ return self.path if self.path.is_file() else list(
+ self.path.glob("*.md")).pop()
+
+ def determine_other_files(self) -> list[Path]:
+ """
+ Determine other files in the directory excluding the main file.
+
+ This method checks if the given path is a file. If it is, it returns an empty list.
+ Otherwise, it recursively searches through the directory and returns a list of all
+ files that are not the main file.
+
+ Returns:
+ list[Path]: A list of Path objects representing the other files in the directory.
+ """ # noqa: E501
+ if self.path.is_file():
+ return []
+ is_other_file = lambda file: file.is_file() and file != self.main_file
+ return [file for file in self.path.rglob("*") if is_other_file(file)]
+
+ def determine_title(self) -> str:
+ return fix_case(self.path.stem.replace("_", " ").title())
+
+ def generate(self) -> str:
+ content = f"---\ntitle: {self.title}\n---\n\n"
+ content += f"Source .\n\n"
+
+ # Use long code fence to avoid issues with
+ # included files containing code fences too
+ code_fence = "``````"
+ is_code = self.main_file.suffix != ".md"
+ if is_code:
+ content += f"{code_fence}{self.main_file.suffix[1:]}\n"
+ content += f'--8<-- "{self.main_file}"\n'
+ if is_code:
+ content += f"{code_fence}\n"
+ content += "\n"
+
+ if not self.other_files:
+ return content
+
+ content += "## Example materials\n\n"
+ for file in sorted(self.other_files):
+ content += f'??? abstract "{file.relative_to(self.path)}"\n'
+ if file.suffix != ".md":
+ content += f" {code_fence}{file.suffix[1:]}\n"
+ content += f' --8<-- "{file}"\n'
+ if file.suffix != ".md":
+ content += f" {code_fence}\n"
+
+ return content
+
+
+def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+ # Create the EXAMPLE_DOC_DIR if it doesn't exist
+ if not EXAMPLE_DOC_DIR.exists():
+ EXAMPLE_DOC_DIR.mkdir(parents=True)
+
+ categories = sorted(p for p in EXAMPLE_DIR.iterdir() if p.is_dir())
+
+ examples = []
+ glob_patterns = ["*.py", "*.md", "*.sh"]
+ # Find categorised examples
+ for category in categories:
+ globs = [category.glob(pattern) for pattern in glob_patterns]
+ for path in itertools.chain(*globs):
+ examples.append(Example(path, category.stem))
+ # Find examples in subdirectories
+ for path in category.glob("*/*.md"):
+ examples.append(Example(path.parent, category.stem))
+
+ # Generate the example documentation
+ for example in sorted(examples, key=lambda e: e.path.stem):
+ example_name = f"{example.path.stem}.md"
+ doc_path = EXAMPLE_DOC_DIR / example.category / example_name
+ print(doc_path)
+ if not doc_path.parent.exists():
+ doc_path.parent.mkdir(parents=True)
+ with open(doc_path, "w+") as f:
+ f.write(example.generate())
diff --git a/docs/mkdocs/hooks/remove_announcement.py b/docs/mkdocs/hooks/remove_announcement.py
new file mode 100644
index 000000000000..e5f8549d8383
--- /dev/null
+++ b/docs/mkdocs/hooks/remove_announcement.py
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: Apache-2.0
+import os
+from typing import Literal
+
+
+def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+ # see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
+ if os.getenv('READTHEDOCS_VERSION_TYPE') == "tag":
+ # remove the warning banner if the version is a tagged release
+ docs_dir = os.path.dirname(__file__)
+ announcement_path = os.path.join(docs_dir,
+ "mkdocs/overrides/main.html")
+ # The file might be removed already if the build is triggered multiple
+ # times (readthedocs build both HTML and PDF versions separately)
+ if os.path.exists(announcement_path):
+ os.remove(announcement_path)
diff --git a/docs/mkdocs/hooks/url_schemes.py b/docs/mkdocs/hooks/url_schemes.py
new file mode 100644
index 000000000000..c738828085ba
--- /dev/null
+++ b/docs/mkdocs/hooks/url_schemes.py
@@ -0,0 +1,53 @@
+# SPDX-License-Identifier: Apache-2.0
+import regex as re
+from mkdocs.config.defaults import MkDocsConfig
+from mkdocs.structure.files import Files
+from mkdocs.structure.pages import Page
+
+
+def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
+ files: Files):
+ gh_icon = ":octicons-mark-github-16:"
+ gh_url = "https://github.com"
+ repo_url = f"{gh_url}/vllm-project/vllm"
+ org_url = f"{gh_url}/orgs/vllm-project"
+ urls = {
+ "issue": f"{repo_url}/issues",
+ "pr": f"{repo_url}/pull",
+ "project": f"{org_url}/projects",
+ "dir": f"{repo_url}/tree/main",
+ "file": f"{repo_url}/blob/main",
+ }
+ titles = {
+ "issue": "Issue #",
+ "pr": "Pull Request #",
+ "project": "Project #",
+ "dir": "",
+ "file": "",
+ }
+
+ scheme = r"gh-(?P.+?):(?P.+?)(#(?P.+?))?"
+ inline_link = re.compile(r"\[(?P[^\[]+?)\]\(" + scheme + r"\)")
+ auto_link = re.compile(f"<{scheme}>")
+
+ def replace_inline_link(match: re.Match) -> str:
+ url = f'{urls[match.group("type")]}/{match.group("path")}'
+ if fragment := match.group("fragment"):
+ url += f"#{fragment}"
+
+ return f'[{gh_icon} {match.group("title")}]({url})'
+
+ def replace_auto_link(match: re.Match) -> str:
+ type = match.group("type")
+ path = match.group("path")
+ title = f"{titles[type]}{path}"
+ url = f"{urls[type]}/{path}"
+ if fragment := match.group("fragment"):
+ url += f"#{fragment}"
+
+ return f"[{gh_icon} {title}]({url})"
+
+ markdown = inline_link.sub(replace_inline_link, markdown)
+ markdown = auto_link.sub(replace_auto_link, markdown)
+
+ return markdown
diff --git a/docs/source/_static/custom.js b/docs/mkdocs/javascript/run_llm_widget.js
similarity index 54%
rename from docs/source/_static/custom.js
rename to docs/mkdocs/javascript/run_llm_widget.js
index 58bc2ebb9614..d0e5560e92b4 100644
--- a/docs/source/_static/custom.js
+++ b/docs/mkdocs/javascript/run_llm_widget.js
@@ -17,22 +17,3 @@ document.addEventListener("DOMContentLoaded", function () {
script.async = true;
document.head.appendChild(script);
});
-
-// Update URL search params when tab is clicked
- document.addEventListener("DOMContentLoaded", function () {
- const tabs = document.querySelectorAll(".sd-tab-label");
-
- function updateURL(tab) {
- const syncGroup = tab.getAttribute("data-sync-group");
- const syncId = tab.getAttribute("data-sync-id");
- if (syncGroup && syncId) {
- const url = new URL(window.location);
- url.searchParams.set(syncGroup, syncId);
- window.history.replaceState(null, "", url);
- }
- }
-
- tabs.forEach(tab => {
- tab.addEventListener("click", () => updateURL(tab));
- });
-});
diff --git a/docs/mkdocs/overrides/main.html b/docs/mkdocs/overrides/main.html
new file mode 100644
index 000000000000..bdd62ebc158d
--- /dev/null
+++ b/docs/mkdocs/overrides/main.html
@@ -0,0 +1,5 @@
+{% extends "base.html" %}
+
+{% block announce %}
+
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
+{% endblock %}
diff --git a/docs/mkdocs/stylesheets/extra.css b/docs/mkdocs/stylesheets/extra.css
new file mode 100644
index 000000000000..088143ed5956
--- /dev/null
+++ b/docs/mkdocs/stylesheets/extra.css
@@ -0,0 +1,36 @@
+/* Warning for latest docs */
+.md-banner {
+ background-color: var(--md-warning-bg-color);
+ color: var(--md-warning-fg-color);
+}
+
+/* https://christianoliff.com/blog/styling-external-links-with-an-icon-in-css/ */
+a:not(:has(svg)):not(.md-icon):not(.autorefs-external) {
+ align-items: center;
+
+ &[href^="//"]::after,
+ &[href^="http://"]::after,
+ &[href^="https://"]::after {
+ content: "";
+ width: 12px;
+ height: 12px;
+ margin-left: 4px;
+ background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='16' height='16' stroke='gray' viewBox='0 0 16 16'%3E%3Cpath fill-rule='evenodd' d='M8.636 3.5a.5.5 0 0 0-.5-.5H1.5A1.5 1.5 0 0 0 0 4.5v10A1.5 1.5 0 0 0 1.5 16h10a1.5 1.5 0 0 0 1.5-1.5V7.864a.5.5 0 0 0-1 0V14.5a.5.5 0 0 1-.5.5h-10a.5.5 0 0 1-.5-.5v-10a.5.5 0 0 1 .5-.5h6.636a.5.5 0 0 0 .5-.5z'/%3E%3Cpath fill-rule='evenodd' d='M16 .5a.5.5 0 0 0-.5-.5h-5a.5.5 0 0 0 0 1h3.793L6.146 9.146a.5.5 0 1 0 .708.708L15 1.707V5.5a.5.5 0 0 0 1 0v-5z'/%3E%3C/svg%3E");
+ background-position: center;
+ background-repeat: no-repeat;
+ background-size: contain;
+ display: inline-block;
+ }
+}
+
+/* Light mode: darker section titles */
+body[data-md-color-scheme="default"] .md-nav__item--section > label.md-nav__link .md-ellipsis {
+ color: rgba(0, 0, 0, 0.7) !important;
+ font-weight: 700;
+}
+
+/* Dark mode: lighter gray section titles */
+body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .md-ellipsis {
+ color: rgba(255, 255, 255, 0.75) !important;
+ font-weight: 700;
+}
diff --git a/docs/source/models/extensions/fastsafetensor.md b/docs/models/extensions/fastsafetensor.md
similarity index 100%
rename from docs/source/models/extensions/fastsafetensor.md
rename to docs/models/extensions/fastsafetensor.md
diff --git a/docs/source/models/extensions/runai_model_streamer.md b/docs/models/extensions/runai_model_streamer.md
similarity index 61%
rename from docs/source/models/extensions/runai_model_streamer.md
rename to docs/models/extensions/runai_model_streamer.md
index e0daa6f86dde..6755b574ea67 100644
--- a/docs/source/models/extensions/runai_model_streamer.md
+++ b/docs/models/extensions/runai_model_streamer.md
@@ -1,6 +1,7 @@
-(runai-model-streamer)=
-
-# Loading models with Run:ai Model Streamer
+---
+title: Loading models with Run:ai Model Streamer
+---
+[](){ #runai-model-streamer }
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
@@ -15,19 +16,25 @@ pip3 install vllm[runai]
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
```console
-vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer
+vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
+ --load-format runai_streamer
```
To run model from AWS S3 object store run:
```console
-vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
+vllm serve s3://core-llm/Llama-3-8b \
+ --load-format runai_streamer
```
To run model from a S3 compatible object store run:
```console
-RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
+RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \
+AWS_EC2_METADATA_DISABLED=true \
+AWS_ENDPOINT_URL=https://storage.googleapis.com \
+vllm serve s3://core-llm/Llama-3-8b \
+ --load-format runai_streamer
```
## Tunable parameters
@@ -38,19 +45,22 @@ You can tune `concurrency` that controls the level of concurrency and number of
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
```console
-vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
+vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
+ --load-format runai_streamer \
+ --model-loader-extra-config '{"concurrency":16}'
```
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
```console
-vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
+vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
+ --load-format runai_streamer \
+ --model-loader-extra-config '{"memory_limit":5368709120}'
```
-:::{note}
-For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
-:::
+!!! note
+ For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
## Sharded Model Loading
@@ -63,7 +73,9 @@ vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
```console
-vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
+vllm serve /path/to/sharded/model \
+ --load-format runai_streamer_sharded \
+ --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
```
To create sharded model files, you can use the script provided in . This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
@@ -71,9 +83,10 @@ To create sharded model files, you can use the script provided in for a list of options when initializing the model.
+The [LLM][vllm.LLM] class provides various methods for offline inference.
+See [configuration][configuration] for a list of options when initializing the model.
### `LLM.generate`
-The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
+The [generate][vllm.LLM.generate] method is available to all generative models in vLLM.
It is similar to [its counterpart in HF Transformers](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate),
except that tokenization and detokenization are also performed automatically.
@@ -34,7 +35,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
-You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
+You can optionally control the language generation by passing [SamplingParams][vllm.SamplingParams].
For example, you can use greedy sampling by setting `temperature=0`:
```python
@@ -50,16 +51,15 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
-:::{important}
-By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
+!!! warning
+ By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
-However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
-:::
+ However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here:
### `LLM.beam_search`
-The {class}`~vllm.LLM.beam_search` method implements [beam search](https://huggingface.co/docs/transformers/en/generation_strategies#beam-search) on top of {class}`~vllm.LLM.generate`.
+The [beam_search][vllm.LLM.beam_search] method implements [beam search](https://huggingface.co/docs/transformers/en/generation_strategies#beam-search) on top of [generate][vllm.LLM.generate].
For example, to search using 5 beams and output at most 50 tokens:
```python
@@ -77,14 +77,13 @@ for output in outputs:
### `LLM.chat`
-The {class}`~vllm.LLM.chat` method implements chat functionality on top of {class}`~vllm.LLM.generate`.
+The [chat][vllm.LLM.chat] method implements chat functionality on top of [generate][vllm.LLM.generate].
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
-:::{important}
-In general, only instruction-tuned models have a chat template.
-Base models may perform poorly as they are not trained to respond to the chat conversation.
-:::
+!!! warning
+ In general, only instruction-tuned models have a chat template.
+ Base models may perform poorly as they are not trained to respond to the chat conversation.
```python
from vllm import LLM
@@ -133,7 +132,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Serving
-Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
+Our [OpenAI-Compatible Server][openai-compatible-server] provides endpoints that correspond to the offline APIs:
-- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
-- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
+- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
+- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.
diff --git a/docs/source/models/pooling_models.md b/docs/models/pooling_models.md
similarity index 62%
rename from docs/source/models/pooling_models.md
rename to docs/models/pooling_models.md
index 3fd35e2e8bd1..89a128915a76 100644
--- a/docs/source/models/pooling_models.md
+++ b/docs/models/pooling_models.md
@@ -1,70 +1,48 @@
-(pooling-models)=
-
-# Pooling Models
+---
+title: Pooling Models
+---
+[](){ #pooling-models }
vLLM also supports pooling models, including embedding, reranking and reward models.
-In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmModelForPooling` interface.
-These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
+In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
+These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
before returning them.
-:::{note}
-We currently support pooling models primarily as a matter of convenience.
-As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
-pooling models as they only work on the generation or decode stage, so performance may not improve as much.
-:::
+!!! note
+ We currently support pooling models primarily as a matter of convenience.
+ As shown in the [Compatibility Matrix][compatibility-matrix], most vLLM features are not applicable to
+ pooling models as they only work on the generation or decode stage, so performance may not improve as much.
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:
-:::{list-table}
-:widths: 50 25 25 25
-:header-rows: 1
-
-- * Task
- * Pooling Type
- * Normalization
- * Softmax
-- * Embedding (`embed`)
- * `LAST`
- * โ ๏ธ
- * โ
-- * Classification (`classify`)
- * `LAST`
- * โ
- * โ ๏ธ
-- * Sentence Pair Scoring (`score`)
- * \*
- * \*
- * \*
-- * Reward Modeling (`reward`)
- * `ALL`
- * โ
- * โ
-:::
+| Task | Pooling Type | Normalization | Softmax |
+|---------------------------------|----------------|-----------------|-----------|
+| Embedding (`embed`) | `LAST` | โ ๏ธ | โ |
+| Classification (`classify`) | `LAST` | โ | โ ๏ธ |
+| Sentence Pair Scoring (`score`) | \* | \* | \* |
\*The default pooler is always defined by the model.
-:::{note}
-If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
-:::
+!!! note
+ If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
-:::{tip}
-You can customize the model's pooling method via the `--override-pooler-config` option,
-which takes priority over both the model's and Sentence Transformers's defaults.
-:::
+!!! tip
+ You can customize the model's pooling method via the `--override-pooler-config` option,
+ which takes priority over both the model's and Sentence Transformers's defaults.
## Offline Inference
-The {class}`~vllm.LLM` class provides various methods for offline inference.
-See for a list of options when initializing the model.
+The [LLM][vllm.LLM] class provides various methods for offline inference.
+See [configuration][configuration] for a list of options when initializing the model.
### `LLM.encode`
-The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
It returns the extracted hidden states directly, which is useful for reward models.
```python
@@ -79,7 +57,7 @@ print(f"Data: {data!r}")
### `LLM.embed`
-The {class}`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
+The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.
```python
@@ -96,7 +74,7 @@ A code example can be found here: .
+
+These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
+
+[](){ #transformers-backend }
+
+### Transformers
+
+vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models are supported, and vision language model support is planned!
+
+To check if the modeling backend is Transformers, you can simply do this:
+
+```python
+from vllm import LLM
+llm = LLM(model=..., task="generate") # Name or path of your model
+llm.apply_model(lambda model: print(type(model)))
+```
+
+If it is `TransformersForCausalLM` then it means it's based on Transformers!
+
+!!! tip
+ You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][openai-compatible-server].
+
+!!! note
+ vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
+
+#### Custom models
+
+If a model is neither supported natively by vLLM or Transformers, it can still be used in vLLM!
+
+For a model to be compatible with the Transformers backend for vLLM it must:
+
+- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
+ * The model directory must have the correct structure (e.g. `config.json` is present).
+ * `config.json` must contain `auto_map.AutoModel`.
+- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
+ * Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
+
+If the compatible model is:
+
+- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][openai-compatible-server].
+- in a local directory, simply pass directory path to `model=` for [offline-inference][offline-inference] or `vllm serve ` for the [openai-compatible-server][openai-compatible-server].
+
+This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
+
+[](){ #writing-custom-models }
+
+#### Writing custom models
+
+This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
+
+To make your model compatible with the Transformers backend, it needs:
+
+1. `kwargs` passed down through all modules from `MyModel` to `MyAttention`.
+2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
+3. `MyModel` must contain `_supports_attention_backend = True`.
+
+```python title="modeling_my_model.py"
+
+from transformers import PreTrainedModel
+from torch import nn
+
+class MyAttention(nn.Module):
+
+ def forward(self, hidden_states, **kwargs):
+ ...
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+ attn_output, attn_weights = attention_interface(
+ self,
+ query_states,
+ key_states,
+ value_states,
+ **kwargs,
+ )
+ ...
+
+class MyModel(PreTrainedModel):
+ _supports_attention_backend = True
+```
+
+Here is what happens in the background when this model is loaded:
+
+1. The config is loaded.
+2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
+3. `MyModel` is loaded into `TransformersForCausalLM` (see ) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
+
+That's it!
+
+For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
+
+```python title="configuration_my_model.py"
+
+from transformers import PretrainedConfig
+
+class MyConfig(PretrainedConfig):
+ base_model_tp_plan = {
+ "layers.*.self_attn.k_proj": "colwise",
+ "layers.*.self_attn.v_proj": "colwise",
+ "layers.*.self_attn.o_proj": "rowwise",
+ "layers.*.mlp.gate_proj": "colwise",
+ "layers.*.mlp.up_proj": "colwise",
+ "layers.*.mlp.down_proj": "rowwise",
+ }
+ base_model_pp_plan = {
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+ "norm": (["hidden_states"], ["hidden_states"]),
+ }
+```
+
+- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
+- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
+ * You only need to do this for layers which are not present on all pipeline stages
+ * vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
+ * The `list` in the first element of the `tuple` contains the names of the input arguments
+ * The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
+
+## Loading a Model
+
+### Hugging Face Hub
+
+By default, vLLM loads models from [Hugging Face (HF) Hub](https://huggingface.co/models). To change the download path for models, you can set the `HF_HOME` environment variable; for more details, refer to [their official documentation](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome).
+
+To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
+If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
+
+Models do not _need_ to be natively supported to be used in vLLM.
+The [Transformers backend][transformers-backend] enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
+
+!!! tip
+ The easiest way to check if your model is really supported at runtime is to run the program below:
+
+ ```python
+ from vllm import LLM
+
+ # For generative models (task=generate) only
+ llm = LLM(model=..., task="generate") # Name or path of your model
+ output = llm.generate("Hello, my name is")
+ print(output)
+
+ # For pooling models (task={embed,classify,reward,score}) only
+ llm = LLM(model=..., task="embed") # Name or path of your model
+ output = llm.encode("Hello, my name is")
+ print(output)
+ ```
+
+ If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
+
+Otherwise, please refer to [Adding a New Model][new-model] for instructions on how to implement your model in vLLM.
+Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
+
+#### Download a model
+
+If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
+
+```console
+# Download a model
+huggingface-cli download HuggingFaceH4/zephyr-7b-beta
+
+# Specify a custom cache directory
+huggingface-cli download HuggingFaceH4/zephyr-7b-beta --cache-dir ./path/to/cache
+
+# Download a specific file from a model repo
+huggingface-cli download HuggingFaceH4/zephyr-7b-beta eval_results.json
+```
+
+#### List the downloaded models
+
+Use the Hugging Face CLI to [manage models](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) stored in local cache:
+
+```console
+# List cached models
+huggingface-cli scan-cache
+
+# Show detailed (verbose) output
+huggingface-cli scan-cache -v
+
+# Specify a custom cache directory
+huggingface-cli scan-cache --dir ~/.cache/huggingface/hub
+```
+
+#### Delete a cached model
+
+Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:
+
+```console
+# The `delete-cache` command requires extra dependencies to work with the TUI.
+# Please run `pip install huggingface_hub[cli]` to install them.
+
+# Launch the interactive TUI to select models to delete
+$ huggingface-cli delete-cache
+? Select revisions to delete: 1 revisions selected counting for 438.9M.
+ โ None of the following (if selected, nothing will be deleted).
+Model BAAI/bge-base-en-v1.5 (438.9M, used 1 week ago)
+โฏ โ a5beb1e3: main # modified 1 week ago
+
+Model BAAI/bge-large-en-v1.5 (1.3G, used 1 week ago)
+ โ d4aa6901: main # modified 1 week ago
+
+Model BAAI/bge-reranker-base (1.1G, used 4 weeks ago)
+ โ 2cfc18c9: main # modified 4 weeks ago
+
+Press to select, to validate and to quit without modification.
+
+# Need to confirm after selected
+? Select revisions to delete: 1 revision(s) selected.
+? 1 revisions selected counting for 438.9M. Confirm deletion ? Yes
+Start deletion.
+Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
+```
+
+#### Using a proxy
+
+Here are some tips for loading/downloading models from Hugging Face using a proxy:
+
+- Set the proxy globally for your session (or set it in the profile file):
+
+```shell
+export http_proxy=http://your.proxy.server:port
+export https_proxy=http://your.proxy.server:port
+```
+
+- Set the proxy for just the current command:
+
+```shell
+https_proxy=http://your.proxy.server:port huggingface-cli download