-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Restructured and standardized READMEs
- Loading branch information
Showing
34 changed files
with
1,327 additions
and
493 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
# Distributed Inferencing on vLLM | ||
|
||
There are three common strategies for inference on vLLM: | ||
|
||
- Single GPU (no distributed inference) | ||
- Single-Node Multi-GPU (tensor parallel inference) | ||
- Multi-Node Multi-GPU | ||
|
||
In this guide, you will serve a fine-tuned Gemma large language model (LLM) using graphical processing units (GPUs) on Google Kubernetes Engine (GKE) with the vLLM serving framework with the above mentioned deployment strategies. You can choose to swap the Gemma model with any other fine-tuned or instruction based model for inference on GKE. | ||
|
||
- Single GPU (no distributed inference) - If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference. | ||
- Single-Node Multi-GPU (tensor parallel inference) - If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you need 4 GPUs, you can set the tensor parallel size to 4. | ||
|
||
By the end of this guide, you should be able to perform the following steps: | ||
|
||
- Deploy a vLLM container to your cluster to host your model | ||
- Use vLLM to serve the fine-tuned Gemma model | ||
- View Production metrics for your model serving | ||
- Use custom metrics and Horizontal Pod Autoscaler (HPA) to scale your model | ||
|
||
## Prerequisites | ||
|
||
- This guide was developed to be run on the [playground AI/ML platform](/platforms/gke-aiml/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment. | ||
- A bucket containing the fine-tuned model from the [Fine-tuning example](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md) | ||
|
||
## Preparation | ||
|
||
- Clone the repository | ||
|
||
```sh | ||
git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \ | ||
cd accelerated-platforms | ||
``` | ||
|
||
- Change directory to the guide directory | ||
|
||
```sh | ||
cd use-cases/inferencing/serving/vllm/gcsfuse | ||
``` | ||
|
||
- Ensure that your `MLP_ENVIRONMENT_FILE` is configured | ||
|
||
```sh | ||
cat ${MLP_ENVIRONMENT_FILE} && \ | ||
source ${MLP_ENVIRONMENT_FILE} | ||
``` | ||
|
||
> You should see the various variables populated with the information specific to your environment. | ||
- Configure the environment | ||
|
||
| Variable | Description | Example | | ||
| --------------- | ---------------------------------------- | ------------ | | ||
| SERVE_KSA | The Kubernetes service account | ml-serve-gcs | | ||
| SERVE_NAMESPACE | Namespace where the model will be served | ml-serve | | ||
|
||
```sh | ||
SERVE_KSA=ml-serve-gcs | ||
SERVE_NAMESPACE=ml-serve | ||
``` | ||
|
||
- Get Credentials for the GKE cluster | ||
|
||
```sh | ||
gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID} | ||
``` | ||
|
||
- Create the namespace | ||
|
||
```sh | ||
kubectl create ns ${SERVE_NAMESPACE} | ||
kubectl create sa ${SERVE_KSA} -n ${SERVE_NAMESPACE} | ||
gcloud storage buckets add-iam-policy-binding "gs://${MLP_MODEL_BUCKET}" \ | ||
--member "principal://iam.googleapis.com/projects/${MLP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/${MLP_PROJECT_ID}.svc.id.goog/subject/ns/${SERVE_NAMESPACE}/sa/${SERVE_KSA}" \ | ||
--role "roles/storage.objectViewer" | ||
``` | ||
|
||
## Prepare the Persistent Disk (PD) | ||
|
||
Loading model weights from a PersistentVolume is a method to load models faster. In GKE, PersistentVolumes backed by Google Cloud Persistent Disks can be mounted read-only simultaneously by multiple nodes (ReadOnlyMany), this allows multiple pods access to the model weights from a single volume. | ||
|
||
- Configure the environment | ||
|
||
| Variable | Description | Example | | ||
| ------------- | -------------------------------------------------------------------------------------------- | ------------- | | ||
| ACCELERATOR | Type of GPU accelerator to use (l4, a100, h100) | l4 | | ||
| MODEL_NAME | The name of the model folder in the root of the GCS model bucket | model-gemma2 | | ||
| MODEL_VERSION | The name of the version folder inside the model folder of the GCS model bucket | experiment | | ||
| ZONE | GCP zone where you have accelerators available. The zone must be in the region ${MLP_REGION} | us-central1-a | | ||
|
||
```sh | ||
ACCELERATOR=l4 | ||
MODEL_NAME=model-gemma2 | ||
MODEL_VERSION=experiment | ||
ZONE=us-central1-a | ||
``` | ||
|
||
## Serve the model with vLLM | ||
|
||
- Configure the deployment | ||
|
||
``` | ||
VLLM_IMAGE_NAME="vllm/vllm-openai:v0.6.3.post1" | ||
``` | ||
|
||
```sh | ||
sed \ | ||
-i -e "s|V_MODEL_BUCKET|${MLP_MODEL_BUCKET}|" \ | ||
-i -e "s|V_MODEL_NAME|${MODEL_NAME}|" \ | ||
-i -e "s|V_MODEL_VERSION|${MODEL_VERSION}|" \ | ||
-i -e "s|V_IMAGE_NAME|${VLLM_IMAGE_NAME}|" \ | ||
-i -e "s|V_KSA|${SERVE_KSA}|" \ | ||
manifests/model-deployment-${ACCELERATOR}.yaml | ||
``` | ||
|
||
- Create the deployment | ||
|
||
``` | ||
kubectl --namespace ${SERVE_NAMESPACE} apply -f manifests/model-deployment-${ACCELERATOR}.yaml | ||
``` | ||
|
||
- Wait for the deployment to be ready | ||
|
||
```sh | ||
kubectl --namespace ${SERVE_NAMESPACE} wait --for=condition=ready --timeout=900s pod --selector app=vllm-openai-gcs-${ACCELERATOR} | ||
``` | ||
|
||
## Serve the model through a web chat interface | ||
|
||
- Configure the deployment | ||
|
||
```sh | ||
sed \ | ||
-i -e "s|V_ACCELERATOR|${ACCELERATOR}|g" \ | ||
-i -e "s|V_MODEL_NAME|${MODEL_NAME}|g" \ | ||
-i -e "s|V_MODEL_VERSION|${MODEL_VERSION}|g" \ | ||
manifests/gradio.yaml | ||
``` | ||
|
||
- Create the deployment | ||
|
||
```sh | ||
kubectl --namespace ${SERVE_NAMESPACE} apply -f manifests/gradio.yaml | ||
``` | ||
|
||
- Verify the deployment is ready | ||
|
||
- Access the chat interface | ||
|
||
```sh | ||
echo -e "\nGradio chat interface: ${MLP_GRADIO_NAMESPACE_ENDPOINT}\n" | ||
``` | ||
|
||
- Enter the following prompt in the chat text box to get the response from the model. | ||
|
||
``` | ||
I'm looking for comfortable cycling shorts for women, what are some good options? | ||
``` | ||
|
||
## Metrics | ||
|
||
vLLM exposes a number of metrics that can be used to monitor the health of the system. For more information about accessing these metrics see [vLLM Metrics](/use-cases/inferencing/serving/vllm/metrics/README.md). | ||
|
||
### Run Batch inference on GKE | ||
|
||
Once a model has completed fine-tuning and is deployed on GKE , you can run batch inference on it. Follow the instructions in [batch-inference readme](/use-cases/inferencing/batch-inference/README.md) to run batch inference. | ||
|
||
### Run benchmarks for inference | ||
|
||
The model is ready to run the benchmarks for inference job. Follow [benchmark readme](/use-cases/inferencing/benchmarks/README.md) to run inference benchmarks on our model. | ||
|
||
### Inference at Scale | ||
|
||
You can configure Horizontal Pod Autoscaler to scale your inference deployment based | ||
on relevant metrics. Follow the instructions on | ||
[inference at scale reademe](./inference-scale/README.md) to scale your | ||
deployed model. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
107 changes: 107 additions & 0 deletions
107
use-cases/inferencing/serving/vllm/gcsfuse/manifests/model-deployment-a100.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# Copyright 2024 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: vllm-openai-gcs-a100 | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: vllm-openai-gcs-a100 | ||
template: | ||
metadata: | ||
labels: | ||
app: vllm-openai-gcs-a100 | ||
annotations: | ||
gke-gcsfuse/volumes: "true" | ||
spec: | ||
containers: | ||
- name: inference-server | ||
args: | ||
- --model=$(MODEL) | ||
- --tensor-parallel-size=2 | ||
env: | ||
- name: MODEL | ||
value: /gcs/V_MODEL_NAME/V_MODEL_VERSION | ||
- name: VLLM_ATTENTION_BACKEND | ||
value: FLASHINFER | ||
image: V_IMAGE_NAME | ||
readinessProbe: | ||
failureThreshold: 3 | ||
httpGet: | ||
path: /health | ||
port: 8000 | ||
scheme: HTTP | ||
initialDelaySeconds: 240 | ||
periodSeconds: 5 | ||
successThreshold: 1 | ||
timeoutSeconds: 1 | ||
resources: | ||
requests: | ||
cpu: "2" | ||
memory: "25Gi" | ||
ephemeral-storage: "25Gi" | ||
nvidia.com/gpu: "2" | ||
limits: | ||
cpu: "2" | ||
memory: "25Gi" | ||
ephemeral-storage: "25Gi" | ||
nvidia.com/gpu: "2" | ||
volumeMounts: | ||
- mountPath: /dev/shm | ||
name: dshm | ||
- name: gcs-fuse-csi-ephemeral | ||
mountPath: /gcs | ||
readOnly: true | ||
nodeSelector: | ||
cloud.google.com/gke-accelerator: nvidia-tesla-a100 | ||
serviceAccountName: V_KSA | ||
tolerations: | ||
- key: "nvidia.com/gpu" | ||
operator: "Exists" | ||
effect: "NoSchedule" | ||
- key: "on-demand" | ||
value: "true" | ||
operator: "Equal" | ||
effect: "NoSchedule" | ||
volumes: | ||
- name: dshm | ||
emptyDir: | ||
medium: Memory | ||
- name: gcs-fuse-csi-ephemeral | ||
csi: | ||
driver: gcsfuse.csi.storage.gke.io | ||
volumeAttributes: | ||
bucketName: V_MODEL_BUCKET | ||
mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:max-parallel-downloads:-1" | ||
fileCacheCapacity: "20Gi" | ||
fileCacheForRangeRead: "true" | ||
metadataStatCacheCapacity: "-1" | ||
metadataTypeCacheCapacity: "-1" | ||
metadataCacheTTLSeconds: "-1" | ||
--- | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: vllm-openai-gcs-a100 | ||
spec: | ||
selector: | ||
app: vllm-openai-gcs-a100 | ||
type: ClusterIP | ||
ports: | ||
- protocol: TCP | ||
port: 8000 | ||
targetPort: 8000 |
Oops, something went wrong.