vllm-project · mgoin · Jun 17, 2025 · Jun 16, 2025
@@ -2,4 +2,6 @@ nav:
   - README.md
   - gpu.md
   - cpu.md
-  - ai_accelerator.md
+  - google_tpu.md
+  - intel_gaudi.md
+  - aws_neuron.md
diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md
@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
     - [ARM AArch64](cpu.md#arm-aarch64)
     - [Apple silicon](cpu.md#apple-silicon)
     - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
-- [Other AI accelerators](ai_accelerator.md)
-    - [Google TPU](ai_accelerator.md#google-tpu)
-    - [Intel Gaudi](ai_accelerator.md#intel-gaudi)
-    - [AWS Neuron](ai_accelerator.md#aws-neuron)
+- [Google TPU](google_tpu.md)
+- [Intel Gaudi](intel_gaudi.md)
+- [AWS Neuron](aws_neuron.md)
diff --git a/docs/getting_started/installation/ai_accelerator.md b/docs/getting_started/installation/ai_accelerator.md
diff --git a/...installation/ai_accelerator/neuron.inc.md → ...etting_started/installation/aws_neuron.md b/...installation/ai_accelerator/neuron.inc.md → ...etting_started/installation/aws_neuron.md
@@ -1,24 +1,22 @@
-# --8<-- [start:installation]
+# AWS Neuron
 
 [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
-    generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
-    and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
-    This tab describes how to set up your environment to run vLLM on Neuron.
+generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
+and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
+This describes how to set up your environment to run vLLM on Neuron.
 
 !!! warning
     There are no pre-built wheels or images for this device, so you must build vLLM from source.
 
-# --8<-- [end:installation]
-# --8<-- [start:requirements]
+## Requirements
 
 - OS: Linux
 - Python: 3.9 or newer
 - Pytorch 2.5/2.6
 - Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
 - AWS Neuron SDK 2.23
 
-# --8<-- [end:requirements]
-# --8<-- [start:configure-a-new-environment]
+## Configure a new environment
 
 ### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
 
@@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
 
 - After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
 - Once inside your instance, activate the pre-installed virtual environment for inference by running
+
 ```console
 source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
 ```
@@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
     NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
     library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
 
-# --8<-- [end:configure-a-new-environment]
-# --8<-- [start:set-up-using-python]
+## Set up using Python
 
-# --8<-- [end:set-up-using-python]
-# --8<-- [start:pre-built-wheels]
+### Pre-built wheels
 
 Currently, there are no pre-built Neuron wheels.
 
-# --8<-- [end:pre-built-wheels]
-# --8<-- [start:build-wheel-from-source]
-
-#### Install vLLM from source
+### Build wheel from source
 
-Install vllm as follows:
+To build and install vLLM from source, run:
 
 ```console
 git clone https://github.com/vllm-project/vllm.git
@@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
 ```
 
 AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
-    [https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
-    available on vLLM V0. Please utilize the AWS Fork for the following features:
+<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
+available on vLLM V0. Please utilize the AWS Fork for the following features:
 
 - Llama-3.2 multi-modal support
 - Multi-node distributed inference
@@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
 
 Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
 
-# --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
+## Set up using Docker
 
-# --8<-- [end:set-up-using-docker]
-# --8<-- [start:pre-built-images]
+### Pre-built images
 
 Currently, there are no pre-built Neuron images.
 
-# --8<-- [end:pre-built-images]
-# --8<-- [start:build-image-from-source]
+### Build image from source
 
 See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
 
 Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
 
-# --8<-- [end:build-image-from-source]
-# --8<-- [start:extra-information]
+## Extra information
 
 [](){ #feature-support-through-nxd-inference-backend }
+
 ### Feature support through NxD Inference backend
 
 The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
@@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization
 
 To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
 as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
+
 ```console
 override_neuron_config={
     "enable_bucketing":False,
 }
 ```
+
 or when launching vLLM from the CLI, pass
+
 ```console
 --override-neuron-config "{\"enable_bucketing\":false}"
 ```
@@ -124,32 +118,30 @@ Alternatively, users can directly call the NxDI library to trace and compile you
 ### Known limitations
 
 - EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
-    [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
-    for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
+  [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
+  for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
 - Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
-    [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
-    to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
+  [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
+  to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
 - Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
-    runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
+  runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
 - Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
-    to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
+  to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
 - Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
-    to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
-    to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
+  to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
+  to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
 - Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
-    max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
-    to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
-    for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
-    implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
-
+  max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
+  to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
+  for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
+  implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
 
 ### Environment variables
+
 - `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
-    compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
-    artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
-    but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
-    under this specified path.
+  compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
+  artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
+  but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
+  under this specified path.
 - `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
 - `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
-
-# --8<-- [end:extra-information]
diff --git a/...ed/installation/ai_accelerator/tpu.inc.md → ...etting_started/installation/google_tpu.md b/...ed/installation/ai_accelerator/tpu.inc.md → ...etting_started/installation/google_tpu.md
@@ -1,4 +1,4 @@
-# --8<-- [start:installation]
+# Google TPU
 
 Tensor Processing Units (TPUs) are Google's custom-developed application-specific
 integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
 !!! warning
     There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
 
-# --8<-- [end:installation]
-# --8<-- [start:requirements]
+## Requirements
 
 - Google Cloud TPU VM
 - TPU versions: v6e, v5e, v5p, v4
@@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
 - <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
 - <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
 
-# --8<-- [end:requirements]
-# --8<-- [start:configure-a-new-environment]
+## Configure a new environment
 
 ### Provision a Cloud TPU with the queued resource API
 
@@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
 [TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
 [TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
 
-# --8<-- [end:configure-a-new-environment]
-# --8<-- [start:set-up-using-python]
+## Set up using Python
 
-# --8<-- [end:set-up-using-python]
-# --8<-- [start:pre-built-wheels]
+### Pre-built wheels
 
 Currently, there are no pre-built TPU wheels.
 
-# --8<-- [end:pre-built-wheels]
-# --8<-- [start:build-wheel-from-source]
+### Build wheel from source
 
 Install Miniconda:
 
@@ -142,7 +137,7 @@ Install build dependencies:
 
 ```bash
 pip install -r requirements/tpu.txt
-sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
 ```
 
 Run the setup script:
@@ -151,16 +146,13 @@ Run the setup script:
 VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
 ```
 
-# --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
+## Set up using Docker
 
-# --8<-- [end:set-up-using-docker]
-# --8<-- [start:pre-built-images]
+### Pre-built images
 
 See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
 
-# --8<-- [end:pre-built-images]
-# --8<-- [start:build-image-from-source]
+### Build image from source
 
 You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
 
@@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
     Install OpenBLAS with the following command:
 
     ```console
-    sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+    sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
     ```
-
-# --8<-- [end:build-image-from-source]
-# --8<-- [start:extra-information]
-
-There is no extra information for this device.
-# --8<-- [end:extra-information]