Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating the Yi series models #3958

Merged
merged 16 commits into from
Sep 19, 2024
60 changes: 60 additions & 0 deletions llm/yi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Serving Yi on Your Own Kubernetes or Cloud

🤖 The Yi series models are the next generation of open-source large language models trained from scratch by [01.AI](https://www.lingyiwanwu.com/en).

**Update (Sep 19, 2024) -** SkyPilot now supports the [**Yi**](https://01-ai.github.io/) model(Yi-Coder Yi-1.5)!

<p align="center">
<img src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/coder/bench1.webp" alt="yi" width="600"/>
</p>

## Why use SkyPilot to deploy over commercial hosted solutions?

* Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds.
* Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups.
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
* Everything stays in your Kubernetes or cloud account (your VMs & buckets)
* Completely private - no one else sees your chat history


## Running Yi model with SkyPilot

After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Yi model on vLLM with SkyPilot in 1-click:

1. Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in [yi15-34b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yi15-34b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [yicoder-9b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/yi/yicoder-9b.yaml) or [other model](https://github.com/skypilot-org/skypilot/tree/master/llm/yi) for a smaller model):

```console
sky launch -c yi yi15-34b.yaml
```
2. Send a request to the endpoint for completion:
```bash
ENDPOINT=$(sky status --endpoint 8000 yi)

curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"prompt": "Who are you?",
"max_tokens": 512
}' | jq -r '.choices[0].text'
```

3. Send a request for chat completion:
```bash
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
```
35 changes: 35 additions & 0 deletions llm/yi/yi15-34b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-34B-Chat

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not adding instructions for sky serve up, we can leave these sections out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizing the docs later sounds good to me. Merging for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Thanks a lot, I've changed the file!



resources:
accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
33 changes: 33 additions & 0 deletions llm/yi/yi15-6b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-6B-Chat

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
33 changes: 33 additions & 0 deletions llm/yi/yi15-9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
envs:
MODEL_NAME: 01-ai/Yi-1.5-9B-Chat

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
33 changes: 33 additions & 0 deletions llm/yi/yicoder-1_5b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
envs:
MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
33 changes: 33 additions & 0 deletions llm/yi/yicoder-9b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
envs:
MODEL_NAME: 01-ai/Yi-Coder-9B-Chat

service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2


resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000

setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn

run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
Loading