Skip to content

Commit

Permalink
Merge branch 'main' into chat-template-content-format
Browse files Browse the repository at this point in the history
  • Loading branch information
DarkLight1337 committed Nov 2, 2024
2 parents 82051bf + 74b529c commit c37af03
Show file tree
Hide file tree
Showing 110 changed files with 2,355 additions and 1,504 deletions.
9 changes: 9 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ updates:
reviewers: ["khluu", "simon-mo"]
allow:
- dependency-type: "all"
ignore:
- dependency-name: "torch"
- dependency-name: "torchvision"
- dependency-name: "xformers"
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "lm-eval"
groups:
patch-update:
applies-to: version-updates
Expand Down
158 changes: 144 additions & 14 deletions docs/source/getting_started/tpu-installation.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,167 @@
.. _installation_tpu:

#####################
Installation with TPU
=====================
#####################

vLLM supports Google Cloud TPUs using PyTorch XLA.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
are available in different versions each with different hardware specifications.
For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_.
For more information on the TPU versions supported with vLLM, see:

* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_

These TPU versions allow you to configure the physical arrangements of the TPU
chips. This can improve throughput and networking performance. For more
information see:

* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_

In order for you to use Cloud TPUs you need to have TPU quota granted to your
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you
want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_.

For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.

You may need additional persistent storage for your TPU VMs. For more
information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.

Requirements
------------

* Google Cloud TPU VM (single & multi host)
* TPU versions: v5e, v5p, v4
* Python: 3.10
* Google Cloud TPU VM
* TPU versions: v6e, v5e, v5p, v4
* Python: 3.10 or newer

Provision Cloud TPUs
====================

You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_`
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_`
API. This section shows how to create TPUs using the queued resource API.
For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
`Queued resources <https://cloud.devsite.corp.google.com/tpu/docs/queued-resources>`_
enable you to request Cloud TPU resources in a queued manner. When you request
queued resources, the request is added to a queue maintained by the Cloud TPU
service. When the requested resource becomes available, it's assigned to your
Google Cloud project for your immediate exclusive use.

Provision a Cloud TPU with the queued resource API
--------------------------------------------------
Create a TPU v5e with 4 TPU chips:

.. code-block:: console
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--node-id TPU_NAME \
--project PROJECT_ID \
--zone ZONE \
--accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT
.. list-table:: Parameter descriptions
:header-rows: 1

* - Parameter name
- Description
* - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request.
* - TPU_NAME
- The user-assigned name of the TPU which is created when the queued
resource request is allocated.
* - PROJECT_ID
- Your Google Cloud project
* - ZONE
- The `zone <https://cloud.google.com/tpu/docs/regions-zones>`_ where you
want to create your Cloud TPU.
* - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, followed by a
'-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU
with 4 cores. For more information, see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU using SSH:

.. code-block:: bash
gcloud compute tpus tpu-vm ssh TPU_NAME
Create and activate a Conda environment for vLLM:

.. code-block:: bash
Installation options:
conda create -n vllm python=3.10 -y
conda activate vllm
1. :ref:`Build a docker image with Dockerfile <build_docker_tpu>`.
2. :ref:`Build from source <build_from_source_tpu>`.
Clone the vLLM repository and go to the vLLM directory:

.. code-block:: bash
git clone https://github.com/vllm-project/vllm.git && cd vllm
Uninstall the existing `torch` and `torch_xla` packages:

.. code-block:: bash
pip uninstall torch torch-xla -y
Install `torch` and `torch_xla`

.. code-block:: bash
pip install --pre torch==2.6.0.dev20241028+cpu torchvision==0.20.0.dev20241028+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html
Install JAX and Pallas:

.. code-block:: bash
pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install jaxlib==0.4.32.dev20240829 jax==0.4.32.dev20240829 -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
Install other build dependencies:

.. code-block:: bash
pip install -r requirements-tpu.txt
VLLM_TARGET_DEVICE="tpu" python setup.py develop
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
Provision Cloud TPUs with GKE
-----------------------------

For more information about using TPUs with GKE, see
https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus

.. _build_docker_tpu:

Build a docker image with :code:`Dockerfile.tpu`
------------------------------------------------

`Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ is provided to build a docker image with TPU support.
You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_
to build a Docker image with TPU support.

.. code-block:: console
$ docker build -f Dockerfile.tpu -t vllm-tpu .
You can run the docker image with the following command:
Run the Docker image with the following command:

.. code-block:: console
Expand Down Expand Up @@ -75,14 +207,12 @@ Next, build vLLM from source. This will only take a few seconds:
$ VLLM_TARGET_DEVICE="tpu" python setup.py develop
.. note::

Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.
The compilation time may take 20~30 minutes in the first run.
However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).


.. tip::

If you encounter the following error:
Expand All @@ -93,7 +223,7 @@ Next, build vLLM from source. This will only take a few seconds:
ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory
Please install OpenBLAS with the following command:
Install OpenBLAS with the following command:

.. code-block:: console
Expand Down
8 changes: 4 additions & 4 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,13 +160,13 @@ Text Generation
-
- ✅︎
* - :code:`GraniteForCausalLM`
- PowerLM
- :code:`ibm/PowerLM-3b` etc.
- Granite 3.0, PowerLM
- :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.0-8b-instruct`, :code:`ibm/PowerLM-3b`, etc.
- ✅︎
- ✅︎
* - :code:`GraniteMoeForCausalLM`
- PowerMoE
- :code:`ibm/PowerMoE-3b` etc.
- Granite 3.0 MoE, PowerMoE
- :code:`ibm-granite/granite-3.0-1b-a400m-base`, :code:`ibm-granite/granite-3.0-3b-a800m-instruct`, :code:`ibm/PowerMoE-3b`, etc.
- ✅︎
- ✅︎
* - :code:`InternLMForCausalLM`
Expand Down
6 changes: 1 addition & 5 deletions examples/offline_inference_audio_language.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,7 @@ def run_ultravox(question: str, audio_count: int):
tokenize=False,
add_generation_prompt=True)

llm = LLM(model=model_name,
enforce_eager=True,
enable_chunked_prefill=False,
max_model_len=8192,
limit_mm_per_prompt={"audio": audio_count})
llm = LLM(model=model_name, limit_mm_per_prompt={"audio": audio_count})
stop_token_ids = None
return llm, prompt, stop_token_ids

Expand Down
2 changes: 1 addition & 1 deletion requirements-lint.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# formatting
yapf==0.32.0
toml==0.10.2
tomli==2.0.1
tomli==2.0.2
ruff==0.6.5
codespell==2.3.0
isort==5.13.2
Expand Down
2 changes: 1 addition & 1 deletion requirements-test.in
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,6 @@ aiohttp

# quantization
bitsandbytes>=0.44.0
buildkite-test-collector==0.1.8
buildkite-test-collector==0.1.9

numpy < 2.0.0
12 changes: 6 additions & 6 deletions requirements-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,20 @@ attrs==24.2.0
# referencing
audioread==3.0.1
# via librosa
awscli==1.35.16
awscli==1.35.19
# via -r requirements-test.in
bitsandbytes==0.44.1
# via -r requirements-test.in
black==24.10.0
# via datamodel-code-generator
boto3==1.35.50
boto3==1.35.53
# via tensorizer
botocore==1.35.50
botocore==1.35.53
# via
# awscli
# boto3
# s3transfer
buildkite-test-collector==0.1.8
buildkite-test-collector==0.1.9
# via -r requirements-test.in
certifi==2024.8.30
# via
Expand Down Expand Up @@ -426,7 +426,7 @@ requests==2.32.3
# transformers
rouge-score==0.1.2
# via lm-eval
rpds-py==0.20.0
rpds-py==0.20.1
# via
# jsonschema
# referencing
Expand Down Expand Up @@ -552,7 +552,7 @@ xxhash==3.5.0
# via
# datasets
# evaluate
yarl==1.17.0
yarl==1.17.1
# via aiohttp
zstandard==0.23.0
# via lm-eval
Expand Down
Loading

0 comments on commit c37af03

Please sign in to comment.