Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating the Yi series models #3958

Merged
merged 16 commits into from
Sep 19, 2024
60 changes: 60 additions & 0 deletions llm/yi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Serving Yi on Your Own Kubernetes or Cloud

🤖 The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en).

**Update (Sep 19, 2024) -** SkyPilot now supports the [**Yi**](https://01-ai.github.io/) model(Yi-Coder Yi-1.5)!

<p align="center">
<img src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/coder/bench1.webp" alt="yi" width="600"/>
</p>

## Why use SkyPilot to deploy over commercial hosted solutions?

* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds.
* Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups.
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
* Everything stays in your Kubernetes or cloud account (your VMs & buckets)
* Completely private - no one else sees your chat history


## Running Yi model with SkyPilot

After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Yi model on vLLM with SkyPilot in 1-click:

1. Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in [yi15-34b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yi15-34b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [yicoder-9b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yicoder-9b.yaml) or [other model](https://github.com/skypilot-org/skypilot/tree/master/llm/yi) for a smaller model):

```console
sky launch -c yi yi15-34b.yaml
```
2. Send a request to the endpoint for completion:
```bash
ENDPOINT=$(sky status --endpoint 8000 yi)

curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"prompt": "Who are you?",
"max_tokens": 512
}' | jq -r '.choices[0].text'
```

3. Send a request for chat completion:
```bash
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
```
20 changes: 20 additions & 0 deletions llm/yi/yi15-34b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-34B-Chat

resources:
accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yi15-6b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-6B-Chat

resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yi15-9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-9B-Chat

resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yicoder-1_5b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat

resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
18 changes: 18 additions & 0 deletions llm/yi/yicoder-9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
envs:
MODEL_NAME: 01-ai/Yi-Coder-9B-Chat

resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
Loading