From 170e5b481ba78a5122456f74b41a553a3d16311e Mon Sep 17 00:00:00 2001 From: qinxuye Date: Fri, 6 Dec 2024 02:53:07 +0000 Subject: [PATCH 1/5] DOC: update new models --- README.md | 4 +- README_zh_CN.md | 4 +- .../models/builtin/llm/glm-edge-chat.rst | 111 ++++++++++++++ doc/source/models/builtin/llm/glm-edge-v.rst | 143 ++++++++++++++++++ doc/source/models/builtin/llm/index.rst | 14 ++ 5 files changed, 272 insertions(+), 4 deletions(-) create mode 100644 doc/source/models/builtin/llm/glm-edge-chat.rst create mode 100644 doc/source/models/builtin/llm/glm-edge-v.rst diff --git a/README.md b/README.md index d705bdb9d7..242677abbc 100644 --- a/README.md +++ b/README.md @@ -46,14 +46,14 @@ potential of cutting-edge AI models. - Support speech recognition model: [#929](https://github.com/xorbitsai/inference/pull/929) - Metrics support: [#906](https://github.com/xorbitsai/inference/pull/906) ### New Models +- Built-in support for [GLM Edge](https://github.com/THUDM/GLM-Edge): [#2582](https://github.com/xorbitsai/inference/pull/2582) +- Built-in support for [QwQ-32B-Preview](https://qwenlm.github.io/blog/qwq-32b-preview/): [#2602](https://github.com/xorbitsai/inference/pull/2602) - Built-in support for [Qwen 2.5 Series](https://qwenlm.github.io/blog/qwen2.5/): [#2325](https://github.com/xorbitsai/inference/pull/2325) - Built-in support for [Fish Speech V1.4](https://huggingface.co/fishaudio/fish-speech-1.4): [#2295](https://github.com/xorbitsai/inference/pull/2295) - Built-in support for [DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5): [#2292](https://github.com/xorbitsai/inference/pull/2292) - Built-in support for [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio): [#2271](https://github.com/xorbitsai/inference/pull/2271) - Built-in support for [Qwen2-vl-instruct](https://github.com/QwenLM/Qwen2-VL): [#2205](https://github.com/xorbitsai/inference/pull/2205) - Built-in support for [MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B): [#2263](https://github.com/xorbitsai/inference/pull/2263) -- Built-in support for [CogVideoX](https://github.com/THUDM/CogVideo): [#2049](https://github.com/xorbitsai/inference/pull/2049) -- Built-in support for [flux.1-schnell & flux.1-dev](https://www.basedlabs.ai/tools/flux1): [#2007](https://github.com/xorbitsai/inference/pull/2007) ### Integrations - [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable. - [FastGPT](https://github.com/labring/FastGPT): a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization. diff --git a/README_zh_CN.md b/README_zh_CN.md index b75bb6e905..9672fa23db 100644 --- a/README_zh_CN.md +++ b/README_zh_CN.md @@ -42,14 +42,14 @@ Xorbits Inference(Xinference)是一个性能强大且功能全面的分布 - 支持语音识别模型: [#929](https://github.com/xorbitsai/inference/pull/929) - 增加 Metrics 统计信息: [#906](https://github.com/xorbitsai/inference/pull/906) ### 新模型 +- 内置 [GLM Edge](https://github.com/THUDM/GLM-Edge): [#2582](https://github.com/xorbitsai/inference/pull/2582) +- 内置 [QwQ-32B-Preview](https://qwenlm.github.io/blog/qwq-32b-preview/): [#2602](https://github.com/xorbitsai/inference/pull/2602) - 内置 [Qwen 2.5 Series](https://qwenlm.github.io/blog/qwen2.5/): [#2325](https://github.com/xorbitsai/inference/pull/2325) - 内置 [Fish Speech V1.4](https://huggingface.co/fishaudio/fish-speech-1.4): [#2295](https://github.com/xorbitsai/inference/pull/2295) - 内置 [DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5): [#2292](https://github.com/xorbitsai/inference/pull/2292) - 内置 [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio): [#2271](https://github.com/xorbitsai/inference/pull/2271) - 内置 [Qwen2-vl-instruct](https://github.com/QwenLM/Qwen2-VL): [#2205](https://github.com/xorbitsai/inference/pull/2205) - 内置 [MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B): [#2263](https://github.com/xorbitsai/inference/pull/2263) -- 内置 [CogVideoX](https://github.com/THUDM/CogVideo): [#2049](https://github.com/xorbitsai/inference/pull/2049) -- 内置 [flux.1-schnell & flux.1-dev](https://www.basedlabs.ai/tools/flux1): [#2007](https://github.com/xorbitsai/inference/pull/2007) ### 集成 - [FastGPT](https://doc.fastai.site/docs/development/custom-models/xinference/):一个基于 LLM 大模型的开源 AI 知识库构建平台。提供了开箱即用的数据处理、模型调用、RAG 检索、可视化 AI 工作流编排等能力,帮助您轻松实现复杂的问答场景。 - [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): 一个涵盖了大型语言模型开发、部署、维护和优化的 LLMOps 平台。 diff --git a/doc/source/models/builtin/llm/glm-edge-chat.rst b/doc/source/models/builtin/llm/glm-edge-chat.rst new file mode 100644 index 0000000000..ff9e31d1cc --- /dev/null +++ b/doc/source/models/builtin/llm/glm-edge-chat.rst @@ -0,0 +1,111 @@ +.. _models_llm_glm-edge-chat: + +======================================== +glm-edge-chat +======================================== + +- **Context Length:** 8192 +- **Model Name:** glm-edge-chat +- **Languages:** en, zh +- **Abilities:** chat +- **Description:** The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs. + +Specifications +^^^^^^^^^^^^^^ + + +Model Spec 1 (pytorch, 1_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 1_5 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: Transformers +- **Model ID:** THUDM/glm-edge-1.5b-chat +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 1_5 --model-format pytorch --quantization ${quantization} + + +Model Spec 2 (pytorch, 4 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 4 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: Transformers +- **Model ID:** THUDM/glm-edge-4b-chat +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 4 --model-format pytorch --quantization ${quantization} + + +Model Spec 3 (ggufv2, 1_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 1_5 +- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-1.5b-chat-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 1_5 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 4 (ggufv2, 1_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 1_5 +- **Quantizations:** F16 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-1.5b-chat-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 1_5 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 5 (ggufv2, 4 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 4 +- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-4b-chat-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 4 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 6 (ggufv2, 4 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 4 +- **Quantizations:** F16 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-4b-chat-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 4 --model-format ggufv2 --quantization ${quantization} + diff --git a/doc/source/models/builtin/llm/glm-edge-v.rst b/doc/source/models/builtin/llm/glm-edge-v.rst new file mode 100644 index 0000000000..cca4b184d8 --- /dev/null +++ b/doc/source/models/builtin/llm/glm-edge-v.rst @@ -0,0 +1,143 @@ +.. _models_llm_glm-edge-v: + +======================================== +glm-edge-v +======================================== + +- **Context Length:** 8192 +- **Model Name:** glm-edge-v +- **Languages:** en, zh +- **Abilities:** chat, vision +- **Description:** The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs. + +Specifications +^^^^^^^^^^^^^^ + + +Model Spec 1 (pytorch, 2 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 2 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: Transformers +- **Model ID:** THUDM/glm-edge-v-2b +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format pytorch --quantization ${quantization} + + +Model Spec 2 (pytorch, 5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 5 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: Transformers +- **Model ID:** THUDM/glm-edge-v-5b +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format pytorch --quantization ${quantization} + + +Model Spec 3 (ggufv2, 2 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 2 +- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-v-2b-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 4 (ggufv2, 2 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 2 +- **Quantizations:** F16 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-v-2b-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 5 (ggufv2, 2 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 2 +- **Quantizations:** f16 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-v-2b-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 6 (ggufv2, 5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 5 +- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-v-5b-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 7 (ggufv2, 5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 5 +- **Quantizations:** F16 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-v-5b-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format ggufv2 --quantization ${quantization} + + +Model Spec 8 (ggufv2, 5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** ggufv2 +- **Model Size (in billions):** 5 +- **Quantizations:** f16 +- **Engines**: llama.cpp +- **Model ID:** THUDM/glm-edge-v-5b-gguf +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format ggufv2 --quantization ${quantization} + diff --git a/doc/source/models/builtin/llm/index.rst b/doc/source/models/builtin/llm/index.rst index 4d55123965..89ee963dab 100644 --- a/doc/source/models/builtin/llm/index.rst +++ b/doc/source/models/builtin/llm/index.rst @@ -166,6 +166,16 @@ The following is a list of built-in LLM in Xinference: - 8192 - GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI. + * - :ref:`glm-edge-chat ` + - chat + - 8192 + - The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs. + + * - :ref:`glm-edge-v ` + - chat, vision + - 8192 + - The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs. + * - :ref:`glm4-chat ` - chat, tools - 131072 @@ -616,6 +626,10 @@ The following is a list of built-in LLM in Xinference: glm-4v + glm-edge-chat + + glm-edge-v + glm4-chat glm4-chat-1m From 1cf7ebcae16e7f45947e60d08b0effc2ff4a226d Mon Sep 17 00:00:00 2001 From: qinxuye Date: Fri, 6 Dec 2024 02:57:36 +0000 Subject: [PATCH 2/5] update --- doc/source/models/model_abilities/vision.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/source/models/model_abilities/vision.rst b/doc/source/models/model_abilities/vision.rst index 65c854b24c..a08c71c0a6 100644 --- a/doc/source/models/model_abilities/vision.rst +++ b/doc/source/models/model_abilities/vision.rst @@ -33,6 +33,7 @@ The ``vision`` ability is supported with the following models in Xinference: * :ref:`qwen2-vl-instruct ` * :ref:`llama-3.2-vision ` * :ref:`llama-3.2-vision-instruct ` +* :ref:`glm-edge-v ` Quickstart From 7c7b063b9ab18a0ae9c4047a6993d05d16a9b882 Mon Sep 17 00:00:00 2001 From: qinxuye Date: Fri, 6 Dec 2024 03:02:57 +0000 Subject: [PATCH 3/5] update --- doc/source/models/model_abilities/tools.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/doc/source/models/model_abilities/tools.rst b/doc/source/models/model_abilities/tools.rst index 6679005387..1b4ba9223d 100644 --- a/doc/source/models/model_abilities/tools.rst +++ b/doc/source/models/model_abilities/tools.rst @@ -33,9 +33,11 @@ Supported models The ``tools`` ability is supported with the following models in Xinference: * :ref:`models_llm_qwen-chat` -* :ref:`models_llm_chatglm3` -* :ref:`models_llm_gorilla-openfunctions-v1` - +* :ref:`models_llm_glm4-chat` +* :ref:`models_llm_glm4-chat-1m` +* :ref:`models_llm_llama-3.1-instruct` +* :ref:`models_llm_qwen2.5-instruct` +* :ref:`models_llm_qwen2.5-coder-instruct` Quickstart ============== From 7aa74f821126c73a02ebfd258bd09e67beafc941 Mon Sep 17 00:00:00 2001 From: qinxuye Date: Fri, 6 Dec 2024 03:08:43 +0000 Subject: [PATCH 4/5] change to macos-13 --- .github/workflows/python.yaml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/python.yaml b/.github/workflows/python.yaml index 75136ec76a..4cbef6df75 100644 --- a/.github/workflows/python.yaml +++ b/.github/workflows/python.yaml @@ -73,12 +73,12 @@ jobs: strategy: fail-fast: false matrix: - os: [ "ubuntu-latest", "macos-12", "windows-latest" ] + os: [ "ubuntu-latest", "macos-13", "windows-latest" ] python-version: [ "3.9", "3.10", "3.11", "3.12" ] module: [ "xinference" ] exclude: - - { os: macos-12, python-version: 3.10 } - - { os: macos-12, python-version: 3.11 } + - { os: macos-13, python-version: 3.10 } + - { os: macos-13, python-version: 3.11 } - { os: windows-latest, python-version: 3.10 } - { os: windows-latest, python-version: 3.11 } include: From d916a854e19782704e793d7b1fa43a1af95d2764 Mon Sep 17 00:00:00 2001 From: qinxuye Date: Fri, 6 Dec 2024 03:27:10 +0000 Subject: [PATCH 5/5] upgrade diffusers --- .github/workflows/python.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/python.yaml b/.github/workflows/python.yaml index 4cbef6df75..77ff0ef0e9 100644 --- a/.github/workflows/python.yaml +++ b/.github/workflows/python.yaml @@ -185,6 +185,7 @@ jobs: ${{ env.SELF_HOST_PYTHON }} -m pip install -U cachetools ${{ env.SELF_HOST_PYTHON }} -m pip install -U silero-vad ${{ env.SELF_HOST_PYTHON }} -m pip install -U pydantic + ${{ env.SELF_HOST_PYTHON }} -m pip install -U diffusers ${{ env.SELF_HOST_PYTHON }} -m pytest --timeout=1500 \ --disable-warnings \ --cov-config=setup.cfg --cov-report=xml --cov=xinference xinference/core/tests/test_continuous_batching.py && \