From 4f033b0a9d5faadecea09b224b49f84d42d2f2b7 Mon Sep 17 00:00:00 2001 From: ChengjieLi Date: Fri, 7 Jun 2024 12:46:44 +0800 Subject: [PATCH 1/2] doc --- doc/source/user_guide/continuous_batching.rst | 69 +++++++++++++++++++ doc/source/user_guide/index.rst | 1 + 2 files changed, 70 insertions(+) create mode 100644 doc/source/user_guide/continuous_batching.rst diff --git a/doc/source/user_guide/continuous_batching.rst b/doc/source/user_guide/continuous_batching.rst new file mode 100644 index 0000000000..7c3a468099 --- /dev/null +++ b/doc/source/user_guide/continuous_batching.rst @@ -0,0 +1,69 @@ +.. _user_guide_continuous_batching: + +================================== +Continuous Batching (experimental) +================================== + +Continuous batching, as a means to improve throughput during model serving, has already been implemented in inference engines like ``VLLM``. +Xinference aims to provide this optimization capability when using the transformers engine as well. + +Usage +===== +Currently, this feature can be enabled under the following conditions: + +* First, set the environment variable ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting xinference. For example: + +.. code-block:: + + XINFERENCE_TRANSFORMERS_ENABLE_BATCHING=1 xinference-local --log-level debug + + +* Then, ensure that the ``transformers`` engine is selected when launching the model. For example: + +.. tabs:: + + .. code-tab:: bash shell + + xinference launch -e --model-engine transformers -n qwen1.5-chat -s 4 -f pytorch -q none + + .. code-tab:: bash cURL + + curl -X 'POST' \ + 'http://127.0.0.1:9997/v1/models' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model_engine": "transformers", + "model_name": "qwen1.5-chat", + "model_format": "pytorch", + "size_in_billions": 4, + "quantization": "none" + }' + + .. code-tab:: python + + from xinference.client import Client + client = Client("http://127.0.0.1:9997") + model_uid = client.launch_model( + model_engine="transformers", + model_name="qwen1.5-chat", + model_format="pytorch", + model_size_in_billions=4, + quantization="none" + ) + print('Model uid: ' + model_uid) + + +Once this feature is enabled, all ``chat`` requests will be managed by continuous batching, +and the average throughput of requests made to a single model will increase. +The usage of the ``chat`` interface remains exactly the same as before, with no differences. + +Note +==== + +* Currently, this feature only supports the ``chat`` interface for ``LLM`` models. + +* If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model. + The ``launch_model`` interface provides the ``max_num_seqs`` parameter to adjust the concurrency level, with a default value of ``16``. + +* This feature is still in the experimental stage, and we welcome your active feedback on any issues. diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index edacf8ba10..8ba1eeb5d9 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -11,3 +11,4 @@ User Guide client_api auth_system metrics + continuous_batching From 4534279d65e4a56efee2dc467f796dcaab20943c Mon Sep 17 00:00:00 2001 From: ChengjieLi Date: Fri, 7 Jun 2024 14:48:47 +0800 Subject: [PATCH 2/2] chinese doc --- .../user_guide/continuous_batching.po | 93 +++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po diff --git a/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po b/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po new file mode 100644 index 0000000000..b192ebc6df --- /dev/null +++ b/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po @@ -0,0 +1,93 @@ +# SOME DESCRIPTIVE TITLE. +# Copyright (C) 2023, Xorbits Inc. +# This file is distributed under the same license as the Xinference package. +# FIRST AUTHOR , 2024. +# +#, fuzzy +msgid "" +msgstr "" +"Project-Id-Version: Xinference \n" +"Report-Msgid-Bugs-To: \n" +"POT-Creation-Date: 2024-06-07 14:38+0800\n" +"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" +"Last-Translator: FULL NAME \n" +"Language-Team: LANGUAGE \n" +"MIME-Version: 1.0\n" +"Content-Type: text/plain; charset=utf-8\n" +"Content-Transfer-Encoding: 8bit\n" +"Generated-By: Babel 2.11.0\n" + +#: ../../source/user_guide/continuous_batching.rst:5 +msgid "Continuous Batching (experimental)" +msgstr "连续批处理(实验性质)" + +#: ../../source/user_guide/continuous_batching.rst:7 +msgid "" +"Continuous batching, as a means to improve throughput during model " +"serving, has already been implemented in inference engines like ``VLLM``." +" Xinference aims to provide this optimization capability when using the " +"transformers engine as well." +msgstr "" +"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference 旨在" +"通过这项技术提升 ``transformers`` 推理引擎的吞吐。" + +#: ../../source/user_guide/continuous_batching.rst:11 +msgid "Usage" +msgstr "使用方式" + +#: ../../source/user_guide/continuous_batching.rst:12 +msgid "Currently, this feature can be enabled under the following conditions:" +msgstr "当前,此功能在满足以下条件时开启:" + +#: ../../source/user_guide/continuous_batching.rst:14 +msgid "" +"First, set the environment variable " +"``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting " +"xinference. For example:" +msgstr "" +"首先,启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` 置为 ``1`` 。" + +#: ../../source/user_guide/continuous_batching.rst:21 +msgid "" +"Then, ensure that the ``transformers`` engine is selected when launching " +"the model. For example:" +msgstr "" +"然后,启动 LLM 模型时选择 ``transformers`` 推理引擎。例如:" + +#: ../../source/user_guide/continuous_batching.rst:57 +msgid "" +"Once this feature is enabled, all ``chat`` requests will be managed by " +"continuous batching, and the average throughput of requests made to a " +"single model will increase. The usage of the ``chat`` interface remains " +"exactly the same as before, with no differences." +msgstr "" +"一旦此功能开启,``chat`` 接口将被此功能接管,别的接口不受影响。``chat`` 接口的使用方式没有任何变化。" + +#: ../../source/user_guide/continuous_batching.rst:62 +msgid "Note" +msgstr "注意事项" + +#: ../../source/user_guide/continuous_batching.rst:64 +msgid "" +"Currently, this feature only supports the ``chat`` interface for ``LLM`` " +"models." +msgstr "当前,此功能仅支持 LLM 模型的 ``chat`` 功能。" + +#: ../../source/user_guide/continuous_batching.rst:66 +msgid "" +"If using GPU inference, this method will consume more GPU memory. Please " +"be cautious when increasing the number of concurrent requests to the same" +" model. The ``launch_model`` interface provides the ``max_num_seqs`` " +"parameter to adjust the concurrency level, with a default value of " +"``16``." +msgstr "" +"如果使用 GPU 推理,此功能对显存要求较高。因此请谨慎提高对同一个模型的并发请求量。" +"``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度,默认值为 ``16`` 。" + +#: ../../source/user_guide/continuous_batching.rst:69 +msgid "" +"This feature is still in the experimental stage, and we welcome your " +"active feedback on any issues." +msgstr "" +"此功能仍处于实验阶段,欢迎反馈任何问题。" +