Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/getting_started/installation/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ nav:
- README.md
- gpu.md
- cpu.md
- ai_accelerator.md
- google_tpu.md
- intel_gaudi.md
- aws_neuron.md
7 changes: 3 additions & 4 deletions docs/getting_started/installation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
- [ARM AArch64](cpu.md#arm-aarch64)
- [Apple silicon](cpu.md#apple-silicon)
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md)
- [Google TPU](ai_accelerator.md#google-tpu)
- [Intel Gaudi](ai_accelerator.md#intel-gaudi)
- [AWS Neuron](ai_accelerator.md#aws-neuron)
- [Google TPU](google_tpu.md)
- [Intel Gaudi](intel_gaudi.md)
- [AWS Neuron](aws_neuron.md)
117 changes: 0 additions & 117 deletions docs/getting_started/installation/ai_accelerator.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
# --8<-- [start:installation]
# AWS Neuron

[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This tab describes how to set up your environment to run vLLM on Neuron.
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This describes how to set up your environment to run vLLM on Neuron.

!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.

# --8<-- [end:installation]
# --8<-- [start:requirements]
## Requirements

- OS: Linux
- Python: 3.9 or newer
- Pytorch 2.5/2.6
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- AWS Neuron SDK 2.23

# --8<-- [end:requirements]
# --8<-- [start:configure-a-new-environment]
## Configure a new environment

### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies

Expand All @@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N

- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- Once inside your instance, activate the pre-installed virtual environment for inference by running

```console
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
```
Expand All @@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).

# --8<-- [end:configure-a-new-environment]
# --8<-- [start:set-up-using-python]
## Set up using Python

# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
### Pre-built wheels

Currently, there are no pre-built Neuron wheels.

# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]

#### Install vLLM from source
### Build wheel from source

Install vllm as follows:
To build and install vLLM from source, run:

```console
git clone https://github.com/vllm-project/vllm.git
Expand All @@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
```

AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features:
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features:

- Llama-3.2 multi-modal support
- Multi-node distributed inference
Expand All @@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .

Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.

# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
## Set up using Docker

# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
### Pre-built images

Currently, there are no pre-built Neuron images.

# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
### Build image from source

See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.

Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.

# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
## Extra information

[](){ #feature-support-through-nxd-inference-backend }

### Feature support through NxD Inference backend

The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
Expand All @@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization

To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include

```console
override_neuron_config={
"enable_bucketing":False,
}
```

or when launching vLLM from the CLI, pass

```console
--override-neuron-config "{\"enable_bucketing\":false}"
```
Expand All @@ -124,32 +118,30 @@ Alternatively, users can directly call the NxDI library to trace and compile you
### Known limitations

- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.

max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.

### Environment variables

- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
under this specified path.
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
under this specified path.
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).

# --8<-- [end:extra-information]
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# --8<-- [start:installation]
# Google TPU

Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
Expand Down Expand Up @@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
!!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.

# --8<-- [end:installation]
# --8<-- [start:requirements]
## Requirements

- Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4
Expand Down Expand Up @@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>

# --8<-- [end:requirements]
# --8<-- [start:configure-a-new-environment]
## Configure a new environment

### Provision a Cloud TPU with the queued resource API

Expand Down Expand Up @@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones

# --8<-- [end:configure-a-new-environment]
# --8<-- [start:set-up-using-python]
## Set up using Python

# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
### Pre-built wheels

Currently, there are no pre-built TPU wheels.

# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
### Build wheel from source

Install Miniconda:

Expand Down Expand Up @@ -142,7 +137,7 @@ Install build dependencies:

```bash
pip install -r requirements/tpu.txt
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
```

Run the setup script:
Expand All @@ -151,16 +146,13 @@ Run the setup script:
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
```

# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
## Set up using Docker

# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
### Pre-built images

See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.

# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
### Build image from source

You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.

Expand Down Expand Up @@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
Install OpenBLAS with the following command:

```console
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
```

# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]

There is no extra information for this device.
# --8<-- [end:extra-information]
Loading