Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Supports qwen-chat 1.8B #757

Merged
merged 3 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ repos:
- id: isort
args: [--sp, setup.cfg]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.4.1
rev: v1.7.1
hooks:
- id: mypy
additional_dependencies: ["tokenize-rt==3.2.0", "types-requests", "types-tabulate"]
Expand Down
26 changes: 20 additions & 6 deletions doc/source/models/builtin/llm/qwen-chat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,21 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen-chat --size-in-billions 14 --model-format ggmlv3 --quantization ${quantization}


Model Spec 3 (pytorch, 7 Billion)
Model Spec 3 (pytorch, 1_8 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 1_8
- **Quantizations:** 4-bit, 8-bit, none
- **Model ID:** Qwen/Qwen-1_8B-Chat

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name qwen-chat --size-in-billions 1_8 --model-format pytorch --quantization ${quantization}


Model Spec 4 (pytorch, 7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
Expand All @@ -56,7 +70,7 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen-chat --size-in-billions 7 --model-format pytorch --quantization ${quantization}


Model Spec 4 (pytorch, 14 Billion)
Model Spec 5 (pytorch, 14 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
Expand All @@ -70,7 +84,7 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen-chat --size-in-billions 14 --model-format pytorch --quantization ${quantization}


Model Spec 5 (pytorch, 72 Billion)
Model Spec 6 (pytorch, 72 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
Expand All @@ -84,7 +98,7 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen-chat --size-in-billions 72 --model-format pytorch --quantization ${quantization}


Model Spec 6 (gptq, 7 Billion)
Model Spec 7 (gptq, 7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** gptq
Expand All @@ -98,7 +112,7 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen-chat --size-in-billions 7 --model-format gptq --quantization ${quantization}


Model Spec 7 (gptq, 14 Billion)
Model Spec 8 (gptq, 14 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** gptq
Expand All @@ -112,7 +126,7 @@ chosen quantization method from the options listed above::
xinference launch --model-name qwen-chat --size-in-billions 14 --model-format gptq --quantization ${quantization}


Model Spec 8 (gptq, 72 Billion)
Model Spec 9 (gptq, 72 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** gptq
Expand Down
2 changes: 1 addition & 1 deletion doc/source/models/builtin/rerank/bge-reranker-base.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ bge-reranker-base

- **Model Name:** bge-reranker-base
- **Languages:** en, zh
- **Abilities:** embed
- **Abilities:** rerank

Specifications
^^^^^^^^^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion doc/source/models/builtin/rerank/bge-reranker-large.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ bge-reranker-large

- **Model Name:** bge-reranker-large
- **Languages:** en, zh
- **Abilities:** embed
- **Abilities:** rerank

Specifications
^^^^^^^^^^^^^^
Expand Down
12 changes: 11 additions & 1 deletion xinference/model/llm/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import os
import platform
from abc import abstractmethod
from typing import TYPE_CHECKING, List, Optional, Tuple
from typing import TYPE_CHECKING, List, Optional, Tuple, Union

from ...core.utils import parse_replica_model_uid
from ..core import ModelDescription
Expand Down Expand Up @@ -51,6 +51,16 @@ def __init__(
if kwargs:
raise ValueError(f"Unrecognized keyword arguments: {kwargs}")

@staticmethod
def handle_model_size(model_size_in_billions: Union[str, int]) -> Union[int, float]:
if isinstance(model_size_in_billions, str):
if "_" in model_size_in_billions:
ms = model_size_in_billions.replace("_", ".")
return float(ms)
else:
raise ValueError("Invalid format for `model_size_in_billions`")
return model_size_in_billions

@staticmethod
def _is_darwin_and_apple_silicon():
return platform.system() == "Darwin" and platform.processor() == "arm"
Expand Down
4 changes: 3 additions & 1 deletion xinference/model/llm/ggml/ctransformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,9 @@ def __init__(
self._model_type = None
closest_size = min(
SIZE_TO_GPU_LAYERS.keys(),
key=lambda x: abs(x - model_spec.model_size_in_billions),
key=lambda x: abs(
x - self.handle_model_size(model_spec.model_size_in_billions)
),
)

self._model_family = model_family
Expand Down
4 changes: 3 additions & 1 deletion xinference/model/llm/ggml/llamacpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,9 @@ def __init__(

closest_size = min(
SIZE_TO_GPU_LAYERS.keys(),
key=lambda x: abs(x - model_spec.model_size_in_billions),
key=lambda x: abs(
x - self.handle_model_size(model_spec.model_size_in_billions)
),
)
self._gpu_layers = SIZE_TO_GPU_LAYERS[closest_size]
self._llamacpp_model_config: LlamaCppModelConfig = self._sanitize_model_config(
Expand Down
11 changes: 11 additions & 0 deletions xinference/model/llm/llm_family.json
Original file line number Diff line number Diff line change
Expand Up @@ -1116,6 +1116,17 @@
"model_file_name_template": "qwen14b-ggml-{quantization}.bin",
"model_revision": "11efca556af372b6f3c730322a4962e9900a2990"
},
{
"model_format": "pytorch",
"model_size_in_billions": "1_8",
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_id": "Qwen/Qwen-1_8B-Chat",
"model_revision": "c3db8007171847931da7efa4b2ed4309afcce021"
},
{
"model_format": "pytorch",
"model_size_in_billions": 7,
Expand Down
32 changes: 28 additions & 4 deletions xinference/model/llm/llm_family.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from threading import Lock
from typing import Any, Dict, List, Optional, Tuple, Type, Union

from pydantic import BaseModel, Field, Protocol, ValidationError
from pydantic import BaseModel, Field, Protocol, ValidationError, validator
from pydantic.error_wrappers import ErrorWrapper
from pydantic.parse import load_str_bytes
from pydantic.types import StrBytes
Expand All @@ -45,24 +45,48 @@

class GgmlLLMSpecV1(BaseModel):
model_format: Literal["ggmlv3", "ggufv2"]
model_size_in_billions: int
# Must in order that `str` first, then `int`
model_size_in_billions: Union[str, int]
quantizations: List[str]
model_id: str
model_file_name_template: str
model_hub: str = "huggingface"
model_uri: Optional[str]
model_revision: Optional[str]

@validator("model_size_in_billions", pre=False)
def validate_model_size_with_radix(cls, v: object) -> object:
if isinstance(v, str):
if (
"_" in v
): # for example, "1_8" just returns "1_8", otherwise int("1_8") returns 18
return v
else:
return int(v)
return v


class PytorchLLMSpecV1(BaseModel):
model_format: Literal["pytorch", "gptq"]
model_size_in_billions: int
# Must in order that `str` first, then `int`
model_size_in_billions: Union[str, int]
quantizations: List[str]
model_id: str
model_hub: str = "huggingface"
model_uri: Optional[str]
model_revision: Optional[str]

@validator("model_size_in_billions", pre=False)
def validate_model_size_with_radix(cls, v: object) -> object:
if isinstance(v, str):
if (
"_" in v
): # for example, "1_8" just returns "1_8", otherwise int("1_8") returns 18
return v
else:
return int(v)
return v


class PromptStyleV1(BaseModel):
style_name: str
Expand Down Expand Up @@ -152,7 +176,7 @@ def download_from_self_hosted_storage() -> bool:
def get_legacy_cache_path(
model_name: str,
model_format: str,
model_size_in_billions: Optional[int] = None,
model_size_in_billions: Optional[Union[str, int]] = None,
quantization: Optional[str] = None,
) -> str:
full_name = f"{model_name}-{model_format}-{model_size_in_billions}b-{quantization}"
Expand Down
12 changes: 12 additions & 0 deletions xinference/model/llm/llm_family_modelscope.json
Original file line number Diff line number Diff line change
Expand Up @@ -1366,6 +1366,18 @@
"model_file_name_template": "qwen14b-ggml-{quantization}.bin",
"model_revision": "v0.0.2"
},
{
"model_format": "pytorch",
"model_size_in_billions": "1_8",
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_hub": "modelscope",
"model_id": "qwen/Qwen-1_8B-Chat",
"model_revision": "v1.0.0"
},
{
"model_format": "pytorch",
"model_size_in_billions": 7,
Expand Down
3 changes: 2 additions & 1 deletion xinference/web/ui/src/scenes/launch_model/modelCard.js
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,8 @@ const ModelCard = ({ url, modelData, gpuAvailable, is_custom = false }) => {
.filter(
(spec) =>
spec.model_format === modelFormat &&
spec.model_size_in_billions === parseFloat(modelSize)
spec.model_size_in_billions ===
(modelSize.includes('_') ? modelSize : parseFloat(modelSize))
)
.flatMap((spec) => spec.quantizations)
),
Expand Down
Loading