Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGLang doc user flow updates #703

Merged
merged 19 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ repos:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
args: ['--allow-multiple-documents']
- id: check-added-large-files
- repo: https://github.com/psf/black
rev: 22.10.0
Expand Down
42 changes: 42 additions & 0 deletions docs/shortfin/llm/user/e2e_llama8b_k8s.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# LLama 8b GPU instructions on Kubernetes
saienduri marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also keep this guide general, maybe keep it next to llama_end_to_end.md as llama_serving_on_kubernetes.md, dropping "8B" and "GPU" from the title. Could then also rename llama_end_to_end.md as llama_serving.md? IDK. Naming is hard.

I'm being picky about file names since I want to link to these guides in the release notes, which will then make renaming them later harder without creating 404s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think we should go with llama_serving_on_kubernetes.md and llama_serving.md. end to end can be confusing to what it entails (especially with the sglang layer on top)


## Setup

We will use an example with `llama_8b_f16` in order to describe the
process of exporting a model and deploying four instances of a shortfin llm server
behind a load balancer on MI300X GPU.

### Pre-Requisites

- Kubernetes cluster available to use
- kubectl installed on system and configured for cluster of interest
- To install kubectl, please check out [kubectl install](https://kubernetes.io/docs/tasks/tools/#kubectl)
and make sure to set the `KUBECONFIG` environment variable to point to your kube config file to authorize
connection to the cluster.

### Deploy shortfin llama app service

Save [llama-app-deployment.yaml](../../../../shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml) locally and edit it to include your artifacts and intended configuration.

To deploy llama app:

```
kubectl apply -f llama-app-deployment.yaml
saienduri marked this conversation as resolved.
Show resolved Hide resolved
```

To retrieve external IP for targetting the llama app load balancer:

```
kubectl get service shark-llama-app-service
```

Now, you can use the external IP for sglang integration or just sending text generation requests.

### Delete shortfin llama app service

After done using, make sure to delete:

```
kubectl delete deployment shark-llama-app-deployment
kubectl delete service shark-llama-app-service
```
Loading
Loading