Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] Support deploy LLM model #2515

Merged
merged 2 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,20 @@ python/fastdeploy/code_version.py
log.txt
serving/build
serving/build.encrypt
serving/build.encrypt.auth
serving/build.encrypt.auth
output
res
tmp
log
nohup.out
llm/server/__pycache__
llm/server/data/__pycache__
llm/server/engine/__pycache__
llm/server/http_server/__pycache__
llm/server/log/
llm/client/build/
llm/client/dist/
llm/client/fastdeploy_client.egg-info/
llm/client/fastdeploy_client/tests/log/
*.pyc
*.log
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: a11d9314b22d8f8c7556443875b731ef05965464
rev: ed714747d7acbc5790b171702bb012af3b0fe145
hooks:
- id: check-merge-conflict
- id: check-symlinks
Expand All @@ -9,8 +9,8 @@ repos:
- id: detect-private-key
- id: check-symlinks
- id: check-added-large-files
- repo: local

- repo: local
hooks:
- id: copyright_checker
name: copyright_checker
Expand Down
11 changes: 11 additions & 0 deletions llm/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
README.md
requirements-dev.txt
pyproject.toml
Makefile

dockerfiles/
docs/
server/__pycache__
server/http_server
server/engine
server/data
39 changes: 39 additions & 0 deletions llm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@

<h1 align="center"><b><em>飞桨大模型高性能部署工具FastDeploy</em></b></h1>

*FastDeploy基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计的解决方案。它提供了支持gRPC、HTTP协议的服务接口,以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化(PTQ)等加速优化策略,为用户带来易用且高性能的部署体验。*

# 快速开始

基于预编译镜像部署,本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例,更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md):

```
# 下载模型
wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8

# 挂载模型文件
export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8

docker run --gpus all --shm-size 5G --network=host \
-v ${MODEL_PATH}:/models/ \
-dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.0 \
bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
```

等待服务启动成功(服务初次启动大概需要40s),可以通过以下命令测试:

```
curl 127.0.0.1:9965/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"text": "hello, llm"}'
```

Note:
1. 请保证 shm-size >= 5,不然可能会导致服务启动失败

更多关于 FastDeploy 的使用方法,请查看[服务化部署流程](https://github.com/PaddlePaddle/FastDeploy/blob/develop/llm/docs/FastDeploy_usage_tutorial.md)

# License

FastDeploy 遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/FastDeploy/blob/develop/LICENSE) 。
110 changes: 110 additions & 0 deletions llm/client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# 客户端使用方式

## 简介

FastDeploy客户端提供命令行接口和Python接口,可以快速调用FastDeploy后端部署的LLM模型服务。

## 安装

源码安装
```
pip install .
```

## 命令行接口

首先通过环境变量设置模型服务模式、模型服务URL、模型ID,然后使用命令行接口调用模型服务。

| 参数 | 说明 | 是否必填 | 默认值 |
| --- | --- | --- | --- |
| FASTDEPLOY_MODEL_URL | 模型服务部署的IP地址和端口,格式为`x.x.x.x:xxx`。 | 是 | |

```
export FASTDEPLOY_MODEL_URL="x.x.x.x:xxx"

# 流式接口
fdclient stream_generate "你好?"

# 非流式接口
fdclient generate "你好,你是谁?"
```

## Python接口

首先通过Python代码设置模型服务URL(hostname+port),然后使用Python接口调用模型服务。

| 参数 | 说明 | 是否必填 | 默认值 |
| --- | --- | --- | --- |
| hostname+port | 模型服务部署的IP地址和端口,格式为`x.x.x.x。 | 是 | |


```
from fastdeploy_client.chatbot import ChatBot

hostname = "x.x.x.x"
port = xxx

# 流式接口,stream_generate api的参数说明见附录
chatbot = ChatBot(hostname=hostname, port=port)
stream_result = chatbot.stream_generate("你好", topp=0.8)
for res in stream_result:
print(res)

# 非流式接口,generate api的参数说明见附录
chatbot = ChatBot(hostname=hostname, port=port)
result = chatbot.generate("你好", topp=0.8)
print(result)
```

### 接口说明
```
ChatBot.stream_generate(message,
max_dec_len=1024,
min_dec_len=2,
topp=0.0,
temperature=1.0,
frequency_score=0.0,
penalty_score=1.0,
presence_score=0.0,
eos_token_ids=254186)

# 此函数返回一个iterator,其中每个元素为一个dict, 例如:{"token": "好的", "is_end": 0}
# 其中token为生成的字符,is_end表明是否为生成的最后一个字符(0表示否,1表示是)
# 注意:当生成结果出错时,返回错误信息;不同模型的eos_token_ids不同
```

```
ChatBot.generate(message,
max_dec_len=1024,
min_dec_len=2,
topp=0.0,
temperature=1.0,
frequency_score=0.0,
penalty_score=1.0,
presence_score=0.0,
eos_token_ids=254186)

# 此函数返回一个,例如:{"results": "好的,我知道了。"},其中results即为生成结果
# 注意:当生成结果出错时,返回错误信息;不同模型的eos_token_ids不同
```

### 参数说明

| 字段名 | 字段类型 | 说明 | 是否必填 | 默认值 | 备注 |
| :---: | :-----: | :---: | :---: | :-----: | :----: |
| req_id | str | 请求ID,用于标识一个请求。建议设置req_id,保证其唯一性 | 否 | 随机id | 如果推理服务中同时有两个相同req_id的请求,会返回req_id重复的错误信息 |
| text | str | 请求的文本 | 是 | 无 | |
| max_dec_len | int | 最大生成token的长度,如果请求的文本token长度加上max_dec_len大于模型的max_seq_len,会返回长度超限的错误信息 | 否 | max_seq_len减去文本token长度 | |
| min_dec_len | int | 最小生成token的长度,最小是1 | 否 | 1 | |
| topp | float | 控制随机性参数,数值越大则随机性越大,范围是0~1 | 否 | 0.7 | |
| temperature | float | 控制随机性参数,数值越小随机性越大,需要大于 0 | 否 | 0.95 | |
| frequency_score | float | 频率分数 | 否 | 0 | |
| penalty_score | float | 惩罚分数 | 否 | 1 | |
| presence_score | float | 存在分数 | 否 | 0 | |
| stream | bool | 是否流式返回 | 否 | False | |
| return_all_tokens | bool | 是否一次性返回所有结果 | 否 | False | 与stream参数差异见表后备注 |
| timeout | int | 请求等待的超时时间,单位是秒 | 否 | 300 | |

* 在正确配置PUSH_MODE_HTTP_PORT字段下,服务支持 GRPC 和 HTTP 两种请求服务
* stream 参数仅对 HTTP 请求生效
* return_all_tokens 参数对 GRPC 和 HTTP 请求均有效
20 changes: 20 additions & 0 deletions llm/client/fastdeploy_client/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import sys

__version__ = "4.4.0"

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Loading