Skip to content

Commit

Permalink
Restructured and standardized READMEs
Browse files Browse the repository at this point in the history
  • Loading branch information
arueth committed Nov 8, 2024
1 parent 38f0b4e commit 42aced7
Show file tree
Hide file tree
Showing 39 changed files with 1,588 additions and 493 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ __pycache__/
.venv/
venv/

# Repositories
monitoring-dashboard-samples/

# Terraform
*.terraform/
*.terraform-*/
Expand Down
File renamed without changes.
File renamed without changes
418 changes: 0 additions & 418 deletions use-cases/inferencing/serving-with-vllm/README.md

This file was deleted.

163 changes: 163 additions & 0 deletions use-cases/inferencing/serving/vllm/autoscaling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Inferencing at scale

## Preparation

- Clone the repository

```sh
git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \
cd accelerated-platforms
```

- Change directory to the guide directory

```sh
cd use-cases/inferencing/serving/vllm/autoscaling
```

- Ensure that your `MLP_ENVIRONMENT_FILE` is configured

```sh
cat ${MLP_ENVIRONMENT_FILE} && \
source ${MLP_ENVIRONMENT_FILE}
```

> You should see the various variables populated with the information specific to your environment.
- Configure the environment

| Variable | Description | Example |
| --------------- | ---------------------------------------- | -------- |
| SERVE_NAMESPACE | Namespace where the model will be served | ml-serve |

```sh
SERVE_NAMESPACE=ml-serve
```

## Pre-requisites

- A model is deployed using one of the vLLM guides
- [Serving the mode using vLLM and GCSFuse](/use-cases/inferencing/serving/vllm/gcsfuse/README.md)
- [Serving the mode using vLLM and Persistent Disk](/use-cases/inferencing/serving/vllm/persistent-disk/README.md)
- Metrics are being scraped from the vLLM server ss shown in the [vLLM Metrics](/use-cases/inferencing/serving/vllm/metrics/README.md) guide.

## Metrics to scale the inference on

There are different metrics available that could be used to scale your inference workloads
on GKE:

- Server metrics: LLM inference servers vLLM provides workload-specific
performance metrics. GKE simplifies scraping of those metrics and autoscaling
the workloads based on these server-level metrics. You can use these metrics to
gain visibility into performance indicators like batch size, queue size, and
decode latencies.
In case of vLLM, [production metrics class](https://docs.vllm.ai/en/latest/serving/metrics.html)
exposes a number of useful metrics which GKE can use to horizontally scale
inference workloads.

```sh
vllm:num_requests_running - Number of requests currently running on GPU.
vllm:num_requests_waiting - Number of requests waiting to be processed
```
Here is an example of the metric `vllm:num_requests_running` in metrics explorer
![metrics graph](images/cloud-monitoring-metrics-inference.png)

- GPU metrics: Metrics related to the GPU.

```none
GPU Utilization (DCGM_FI_DEV_GPU_UTIL) - Measures the duty cycle, which is the
amount of time that the GPU is active.
GPU Memory Usage (DCGM_FI_DEV_FB_USED) - Measures how much GPU memory is being
used at a given point in time. This is useful for workloads that implement
dynamic allocation of GPU memory.
```

- CPU metrics: Since the inference workloads primarily rely on GPU resources,
we don't recommend CPU and memory utilization as the only indicators of the
amount of resources a job consumes. Therefore, using CPU metrics alone for
autoscaling can lead to suboptimal performance and costs.

HPA is an efficient way to ensure that your model servers scale appropriately
with load. Fine-tuning the HPA settings is the primary way to align your
provisioned hardware cost with traffic demands to achieve your inference server
performance goals.

We recommend setting these HPA configuration options:

- Stabilization window: Use this HPA configuration option to prevent rapid
replica count changes due to fluctuating metrics. Defaults are 5 minutes for
scale-down (avoiding premature scale-down) and 0 for scale-up (ensuring responsiveness).
Adjust the value based on your workload's volatility and your preferred responsiveness.

- Scaling policies: Use this HPA configuration option to fine-tune the scale-up
and scale-down behavior. You can set the "Pods" policy limit to specify the
absolute number of replicas changed per time unit, and the "Percent" policy
limit to specify by the percentage change.

For more details, see Horizontal pod autoscaling in the Google Cloud Managed
Service for Prometheus [documentation](https://cloud.google.com/kubernetes-engine/docs/horizontal-pod-autoscaling).

### Autoscale with HPA metrics

- Install the Custom Metrics Adapter. This adapter makes the custom metric that you
exported to Cloud Monitoring visible to the HPA. For more details, see HPA
in the [Google Cloud Managed Service for Prometheus documentation](https://cloud.google.com/stackdriver/docs/managed-prometheus/hpa).

```sh
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
```

- Deploy an metric based HPA resource that based on your preferred custom metric.

Choose one of the options below `Queue-depth` or `Batch-size` to configure
the HPA resource in your manifest:

- Queue-depth

```sh
kubectl --namespace ${SERVE_NAMESPACE} apply -f manifests/hpa-vllm-openai-queue-size.yaml
```

- Batch-size

```sh
kubectl --namespace ${SERVE_NAMESPACE} apply -f manifests/hpa-vllm-openai-batch-size.yaml
```

> NOTE: Adjust the appropriate target values for `vllm:num_requests_running`
> or `vllm:num_requests_waiting` in the yaml file.

Once the HPA has been created on a given metric, GKE will autoscale the model
deployment pods when the metric goes over the specified threshold.
It will look something like the following:

```sh
kubectl --namespace ${SERVE_NAMESPACE} get hpa vllm-openai-hpa --watch
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
vllm-openai-hpa Deployment/vllm-openai 1/1 1 5 1 27s
vllm-openai-hpa Deployment/vllm-openai 0/1 1 5 1 76s
vllm-openai-hpa Deployment/vllm-openai 1/1 1 5 1 95s
```

You can also see the new pods coming online:

```sh
kubectl --namespace ${SERVE_NAMESPACE} get pods --watch
NAME READY STATUS RESTARTS AGE
vllm-openai-767b477b77-2jm4v 1/1 Running 0 3d17h
vllm-openai-767b477b77-82l8v 0/1 Pending 0 9s
```

And eventually, the pods will be scaled up:

```sh
kubectl get pods -n ml-serve --watch
NAME READY STATUS RESTARTS AGE
vllm-openai-767b477b77-2jm4v 1/1 Running 0 3d17h
vllm-openai-767b477b77-82l8v 1/1 Running 0 111s
```

If there are GPU resources available on the same node, the new pod may start on
it. Otherwise, a new node will be spun up by the autoscaler with the required
resources and the new pod will be started on it.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
labels:
app.kubernetes.io/name: nvidia-dcgm-exporter
app.kubernetes.io/part-of: google-cloud-managed-prometheus
name: nvidia-dcgm-exporter-for-hpa
spec:
endpoints:
- interval: 15s
metricRelabeling:
- action: keep
sourceLabels: [__name__]
- action: replace
regex: DCGM_FI_DEV_GPU_UTIL
replacement: dcgm_fi_dev_gpu_util
sourceLabels: [__name__]
targetLabel: __name__
port: metrics
selector:
matchLabels:
app.kubernetes.io/name: nvidia-dcgm-exporter
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-openai-hpa
spec:
maxReplicas: 5
metrics:
- pods:
metric:
name: prometheus.googleapis.com|vllm:num_requests_running|gauge
target:
averageValue: 10
type: AverageValue
type: Pods
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-openai
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-openai-hpa
spec:
maxReplicas: 5
metrics:
- pods:
metric:
name: prometheus.googleapis.com|vllm:num_requests_waiting|gauge
target:
averageValue: 10
type: AverageValue
type: Pods
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-openai
Loading

0 comments on commit 42aced7

Please sign in to comment.