xorbitsai · qinxuye · Dec 6, 2024 · Dec 6, 2024 · Dec 6, 2024 · Dec 6, 2024
diff --git a/.github/workflows/python.yaml b/.github/workflows/python.yaml
@@ -73,12 +73,12 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        os: [ "ubuntu-latest", "macos-12", "windows-latest" ]
+        os: [ "ubuntu-latest", "macos-13", "windows-latest" ]
         python-version: [ "3.9", "3.10", "3.11", "3.12" ]
         module: [ "xinference" ]
         exclude:
-          - { os: macos-12, python-version: 3.10 }
-          - { os: macos-12, python-version: 3.11 }
+          - { os: macos-13, python-version: 3.10 }
+          - { os: macos-13, python-version: 3.11 }
           - { os: windows-latest, python-version: 3.10 }
           - { os: windows-latest, python-version: 3.11 }
         include:
@@ -185,6 +185,7 @@ jobs:
             ${{ env.SELF_HOST_PYTHON }} -m pip install -U cachetools
             ${{ env.SELF_HOST_PYTHON }} -m pip install -U silero-vad
             ${{ env.SELF_HOST_PYTHON }} -m pip install -U pydantic
+            ${{ env.SELF_HOST_PYTHON }} -m pip install -U diffusers
             ${{ env.SELF_HOST_PYTHON }} -m pytest --timeout=1500 \
               --disable-warnings \
               --cov-config=setup.cfg --cov-report=xml --cov=xinference xinference/core/tests/test_continuous_batching.py && \

diff --git a/README.md b/README.md
@@ -46,14 +46,14 @@ potential of cutting-edge AI models.
 - Support speech recognition model: [#929](https://github.com/xorbitsai/inference/pull/929)
 - Metrics support: [#906](https://github.com/xorbitsai/inference/pull/906)
 ### New Models
+- Built-in support for [GLM Edge](https://github.com/THUDM/GLM-Edge): [#2582](https://github.com/xorbitsai/inference/pull/2582)
+- Built-in support for [QwQ-32B-Preview](https://qwenlm.github.io/blog/qwq-32b-preview/): [#2602](https://github.com/xorbitsai/inference/pull/2602)
 - Built-in support for [Qwen 2.5 Series](https://qwenlm.github.io/blog/qwen2.5/): [#2325](https://github.com/xorbitsai/inference/pull/2325)
 - Built-in support for [Fish Speech V1.4](https://huggingface.co/fishaudio/fish-speech-1.4): [#2295](https://github.com/xorbitsai/inference/pull/2295)
 - Built-in support for [DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5): [#2292](https://github.com/xorbitsai/inference/pull/2292)
 - Built-in support for [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio): [#2271](https://github.com/xorbitsai/inference/pull/2271)
 - Built-in support for [Qwen2-vl-instruct](https://github.com/QwenLM/Qwen2-VL): [#2205](https://github.com/xorbitsai/inference/pull/2205)
 - Built-in support for [MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B): [#2263](https://github.com/xorbitsai/inference/pull/2263)
-- Built-in support for [CogVideoX](https://github.com/THUDM/CogVideo): [#2049](https://github.com/xorbitsai/inference/pull/2049)
-- Built-in support for [flux.1-schnell & flux.1-dev](https://www.basedlabs.ai/tools/flux1): [#2007](https://github.com/xorbitsai/inference/pull/2007)
 ### Integrations
 - [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.
 - [FastGPT](https://github.com/labring/FastGPT): a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization.

diff --git a/README_zh_CN.md b/README_zh_CN.md
@@ -42,14 +42,14 @@ Xorbits Inference（Xinference）是一个性能强大且功能全面的分布
 - 支持语音识别模型: [#929](https://github.com/xorbitsai/inference/pull/929)
 - 增加 Metrics 统计信息: [#906](https://github.com/xorbitsai/inference/pull/906)
 ### 新模型
+- 内置 [GLM Edge](https://github.com/THUDM/GLM-Edge): [#2582](https://github.com/xorbitsai/inference/pull/2582)
+- 内置 [QwQ-32B-Preview](https://qwenlm.github.io/blog/qwq-32b-preview/): [#2602](https://github.com/xorbitsai/inference/pull/2602)
 - 内置 [Qwen 2.5 Series](https://qwenlm.github.io/blog/qwen2.5/): [#2325](https://github.com/xorbitsai/inference/pull/2325)
 - 内置 [Fish Speech V1.4](https://huggingface.co/fishaudio/fish-speech-1.4): [#2295](https://github.com/xorbitsai/inference/pull/2295)
 - 内置 [DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5): [#2292](https://github.com/xorbitsai/inference/pull/2292)
 - 内置 [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio): [#2271](https://github.com/xorbitsai/inference/pull/2271)
 - 内置 [Qwen2-vl-instruct](https://github.com/QwenLM/Qwen2-VL): [#2205](https://github.com/xorbitsai/inference/pull/2205)
 - 内置 [MiniCPM3-4B](https://huggingface.co/openbmb/MiniCPM3-4B): [#2263](https://github.com/xorbitsai/inference/pull/2263)
-- 内置 [CogVideoX](https://github.com/THUDM/CogVideo): [#2049](https://github.com/xorbitsai/inference/pull/2049)
-- 内置 [flux.1-schnell & flux.1-dev](https://www.basedlabs.ai/tools/flux1): [#2007](https://github.com/xorbitsai/inference/pull/2007)
 ### 集成
 - [FastGPT](https://doc.fastai.site/docs/development/custom-models/xinference/)：一个基于 LLM 大模型的开源 AI 知识库构建平台。提供了开箱即用的数据处理、模型调用、RAG 检索、可视化 AI 工作流编排等能力，帮助您轻松实现复杂的问答场景。
 - [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): 一个涵盖了大型语言模型开发、部署、维护和优化的 LLMOps 平台。

diff --git a/doc/source/models/builtin/llm/glm-edge-chat.rst b/doc/source/models/builtin/llm/glm-edge-chat.rst
@@ -0,0 +1,111 @@
+.. _models_llm_glm-edge-chat:
+
+========================================
+glm-edge-chat
+========================================
+
+- **Context Length:** 8192
+- **Model Name:** glm-edge-chat
+- **Languages:** en, zh
+- **Abilities:** chat
+- **Description:** The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs.
+
+Specifications
+^^^^^^^^^^^^^^
+
+
+Model Spec 1 (pytorch, 1_5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 1_5
+- **Quantizations:** 4-bit, 8-bit, none
+- **Engines**: Transformers
+- **Model ID:** THUDM/glm-edge-1.5b-chat
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-1.5b-chat>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 1_5 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 2 (pytorch, 4 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 4
+- **Quantizations:** 4-bit, 8-bit, none
+- **Engines**: Transformers
+- **Model ID:** THUDM/glm-edge-4b-chat
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-4b-chat>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 4 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 3 (ggufv2, 1_5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 1_5
+- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-1.5b-chat-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-1.5b-chat-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 1_5 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 4 (ggufv2, 1_5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 1_5
+- **Quantizations:** F16
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-1.5b-chat-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-1.5b-chat-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 1_5 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 5 (ggufv2, 4 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 4
+- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-4b-chat-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-4b-chat-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 4 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 6 (ggufv2, 4 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 4
+- **Quantizations:** F16
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-4b-chat-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-4b-chat-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-chat --size-in-billions 4 --model-format ggufv2 --quantization ${quantization}
+
diff --git a/doc/source/models/builtin/llm/glm-edge-v.rst b/doc/source/models/builtin/llm/glm-edge-v.rst
@@ -0,0 +1,143 @@
+.. _models_llm_glm-edge-v:
+
+========================================
+glm-edge-v
+========================================
+
+- **Context Length:** 8192
+- **Model Name:** glm-edge-v
+- **Languages:** en, zh
+- **Abilities:** chat, vision
+- **Description:** The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs.
+
+Specifications
+^^^^^^^^^^^^^^
+
+
+Model Spec 1 (pytorch, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 2
+- **Quantizations:** 4-bit, 8-bit, none
+- **Engines**: Transformers
+- **Model ID:** THUDM/glm-edge-v-2b
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-2b>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-2b>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 2 (pytorch, 5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 5
+- **Quantizations:** 4-bit, 8-bit, none
+- **Engines**: Transformers
+- **Model ID:** THUDM/glm-edge-v-5b
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-5b>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-5b>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 3 (ggufv2, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 2
+- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-v-2b-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-2b-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 4 (ggufv2, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 2
+- **Quantizations:** F16
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-v-2b-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-2b-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 5 (ggufv2, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 2
+- **Quantizations:** f16
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-v-2b-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-2b-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 2 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 6 (ggufv2, 5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 5
+- **Quantizations:** Q4_0, Q4_1, Q4_K, Q4_K_M, Q4_K_S, Q5_0, Q5_1, Q5_K, Q5_K_M, Q5_K_S, Q6_K, Q8_0
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-v-5b-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-5b-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-5b-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 7 (ggufv2, 5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 5
+- **Quantizations:** F16
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-v-5b-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-5b-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-5b-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 8 (ggufv2, 5 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 5
+- **Quantizations:** f16
+- **Engines**: llama.cpp
+- **Model ID:** THUDM/glm-edge-v-5b-gguf
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/THUDM/glm-edge-v-5b-gguf>`__, `ModelScope <https://modelscope.cn/models/ZhipuAI/glm-edge-v-5b-gguf>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name glm-edge-v --size-in-billions 5 --model-format ggufv2 --quantization ${quantization}
+
diff --git a/doc/source/models/builtin/llm/index.rst b/doc/source/models/builtin/llm/index.rst
@@ -166,6 +166,16 @@ The following is a list of built-in LLM in Xinference:
      - 8192
      - GLM4 is the open source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
 
+   * - :ref:`glm-edge-chat <models_llm_glm-edge-chat>`
+     - chat
+     - 8192
+     - The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs.
+
+   * - :ref:`glm-edge-v <models_llm_glm-edge-v>`
+     - chat, vision
+     - 8192
+     - The GLM-Edge series is our attempt to face the end-side real-life scenarios, which consists of two sizes of large-language dialogue models and multimodal comprehension models (GLM-Edge-1.5B-Chat, GLM-Edge-4B-Chat, GLM-Edge-V-2B, GLM-Edge-V-5B). Among them, the 1.5B / 2B model is mainly for platforms such as mobile phones and cars, and the 4B / 5B model is mainly for platforms such as PCs.
+
    * - :ref:`glm4-chat <models_llm_glm4-chat>`
      - chat, tools
      - 131072
@@ -616,6 +626,10 @@ The following is a list of built-in LLM in Xinference:
 
    glm-4v
 
+   glm-edge-chat
+
+   glm-edge-v
+
    glm4-chat
 
    glm4-chat-1m

diff --git a/doc/source/models/model_abilities/tools.rst b/doc/source/models/model_abilities/tools.rst
@@ -33,9 +33,11 @@ Supported models
 The ``tools`` ability is supported with the following models in Xinference:
 
 * :ref:`models_llm_qwen-chat`
-* :ref:`models_llm_chatglm3`
-* :ref:`models_llm_gorilla-openfunctions-v1`
-
+* :ref:`models_llm_glm4-chat`
+* :ref:`models_llm_glm4-chat-1m`
+* :ref:`models_llm_llama-3.1-instruct`
+* :ref:`models_llm_qwen2.5-instruct`
+* :ref:`models_llm_qwen2.5-coder-instruct`
 
 Quickstart
 ==============

diff --git a/doc/source/models/model_abilities/vision.rst b/doc/source/models/model_abilities/vision.rst
@@ -33,6 +33,7 @@ The ``vision`` ability is supported with the following models in Xinference:
 * :ref:`qwen2-vl-instruct <models_llm_qwen2-vl-instruct>`
 * :ref:`llama-3.2-vision <models_llm_llama-3.2-vision>`
 * :ref:`llama-3.2-vision-instruct <models_llm_llama-3.2-vision-instruct>`
+* :ref:`glm-edge-v <models_llm_glm-edge-v>`
 
 
 Quickstart