Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Support gemma series model #1024

Merged
merged 3 commits into from
Feb 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions doc/source/models/builtin/llm/gemma-it.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
.. _models_llm_gemma-it:

========================================
gemma-it
========================================

- **Context Length:** 8192
- **Model Name:** gemma-it
- **Languages:** en
- **Abilities:** chat
- **Description:** Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Specifications
^^^^^^^^^^^^^^


Model Spec 1 (pytorch, 2 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 2
- **Quantizations:** none, 4-bit, 8-bit
- **Model ID:** google/gemma-2b-it
- **Model Hubs**: `Hugging Face <https://huggingface.co/google/gemma-2b-it>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name gemma-it --size-in-billions 2 --model-format pytorch --quantization ${quantization}


Model Spec 2 (pytorch, 7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 7
- **Quantizations:** none, 4-bit, 8-bit
- **Model ID:** google/gemma-7b-it
- **Model Hubs**: `Hugging Face <https://huggingface.co/google/gemma-7b-it>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name gemma-it --size-in-billions 7 --model-format pytorch --quantization ${quantization}

7 changes: 7 additions & 0 deletions doc/source/models/builtin/llm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,11 @@ The following is a list of built-in LLM in Xinference:
- 2048
- Falcon-instruct is a fine-tuned version of the Falcon LLM, specializing in chatting.

* - :ref:`gemma-it <models_llm_gemma-it>`
- chat
- 8192
- Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

* - :ref:`glaive-coder <models_llm_glaive-coder>`
- chat
- 100000
Expand Down Expand Up @@ -358,6 +363,8 @@ The following is a list of built-in LLM in Xinference:

falcon-instruct

gemma-it

glaive-coder

gorilla-openfunctions-v1
Expand Down
2 changes: 1 addition & 1 deletion doc/source/models/builtin/llm/llama-2-chat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ Model Spec 9 (ggufv2, 70 Billion)

- **Model Format:** ggufv2
- **Model Size (in billions):** 70
- **Quantizations:** Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_S, Q5_K_M, Q6_K, Q8_0
- **Quantizations:** Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_S, Q5_K_M
- **Model ID:** TheBloke/Llama-2-70B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF>`__

Expand Down
38 changes: 19 additions & 19 deletions doc/source/models/builtin/llm/qwen1.5-chat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -284,57 +284,57 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen1.5-chat --size-in-billions 72 --model-format awq --quantization ${quantization}


Model Spec 19 (ggufv2, 1_8 Billion)
Model Spec 19 (ggufv2, 0_5 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 1_8
- **Quantizations:** q8_0
- **Model Size (in billions):** 0_5
- **Quantizations:** q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m, q6_k, q8_0
- **Model ID:** Qwen/Qwen1.5-0.5B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-GGUF>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name qwen1.5-chat --size-in-billions 1_8 --model-format ggufv2 --quantization ${quantization}
xinference launch --model-name qwen1.5-chat --size-in-billions 0_5 --model-format ggufv2 --quantization ${quantization}


Model Spec 20 (ggufv2, 4 Billion)
Model Spec 20 (ggufv2, 1_8 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 4
- **Quantizations:** q8_0
- **Model ID:** Qwen/Qwen1.5-4B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-GGUF>`__
- **Model Size (in billions):** 1_8
- **Quantizations:** q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m, q6_k, q8_0
- **Model ID:** Qwen/Qwen1.5-1.8B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat-GGUF>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name qwen1.5-chat --size-in-billions 4 --model-format ggufv2 --quantization ${quantization}
xinference launch --model-name qwen1.5-chat --size-in-billions 1_8 --model-format ggufv2 --quantization ${quantization}


Model Spec 21 (ggufv2, 7 Billion)
Model Spec 21 (ggufv2, 4 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 7
- **Quantizations:** q5_k_m
- **Model ID:** Qwen/Qwen1.5-7B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat-GGUF>`__
- **Model Size (in billions):** 4
- **Quantizations:** q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m, q6_k, q8_0
- **Model ID:** Qwen/Qwen1.5-4B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-GGUF>`__

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name qwen1.5-chat --size-in-billions 7 --model-format ggufv2 --quantization ${quantization}
xinference launch --model-name qwen1.5-chat --size-in-billions 4 --model-format ggufv2 --quantization ${quantization}


Model Spec 22 (ggufv2, 7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 7
- **Quantizations:** q5_k_m
- **Quantizations:** q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m, q6_k, q8_0
- **Model ID:** Qwen/Qwen1.5-7B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-7B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat-GGUF>`__

Expand All @@ -349,7 +349,7 @@ Model Spec 23 (ggufv2, 14 Billion)

- **Model Format:** ggufv2
- **Model Size (in billions):** 14
- **Quantizations:** q5_k_m
- **Quantizations:** q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m, q6_k, q8_0
- **Model ID:** Qwen/Qwen1.5-14B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-GGUF>`__

Expand All @@ -364,7 +364,7 @@ Model Spec 24 (ggufv2, 72 Billion)

- **Model Format:** ggufv2
- **Model Size (in billions):** 72
- **Quantizations:** q2_k
- **Quantizations:** q2_k, q3_k_m
- **Model ID:** Qwen/Qwen1.5-72B-Chat-GGUF
- **Model Hubs**: `Hugging Face <https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF>`__, `ModelScope <https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat-GGUF>`__

Expand Down
2 changes: 1 addition & 1 deletion xinference/deploy/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-devel
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel

COPY . /opt/inference

Expand Down
45 changes: 45 additions & 0 deletions xinference/model/llm/llm_family.json
Original file line number Diff line number Diff line change
Expand Up @@ -3753,5 +3753,50 @@
"<|im_sep|>"
]
}
},
{
"version": 1,
"context_length": 8192,
"model_name": "gemma-it",
"model_lang": [
"en"
],
"model_ability": [
"chat"
],
"model_description": "Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": 2,
"quantizations": [
"none",
"4-bit",
"8-bit"
],
"model_id": "google/gemma-2b-it"
},
{
"model_format": "pytorch",
"model_size_in_billions": 7,
"quantizations": [
"none",
"4-bit",
"8-bit"
],
"model_id": "google/gemma-7b-it"
}
],
"prompt_style": {
"style_name": "gemma",
"roles": [
"user",
"model"
],
"stop": [
"<end_of_turn>",
"<start_of_turn>"
]
}
}
]
9 changes: 9 additions & 0 deletions xinference/model/llm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,15 @@ def get_role(role_name: str):
else:
ret += role + ": </s>"
return ret
elif prompt_style.style_name == "gemma":
ret = ""
for message in chat_history:
content = message["content"]
role = get_role(message["role"])
ret += "<start_of_turn>" + role + "\n"
if content:
ret += content + "<end_of_turn>\n"
return ret
else:
raise ValueError(f"Invalid prompt style: {prompt_style.style_name}")

Expand Down
Loading