Skip to content

Commit

Permalink
Update: zh/inference-endpoints-llm.md (huggingface#1331)
Browse files Browse the repository at this point in the history
* update soc3-zn

* Update _blog.yml

Try to resolve conflicts

* Update: proofreading zh/ethics-soc-3.md

* add how-to-generate cn version

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* unity game in hf space translation completed

* Update: punctuations of how-to-generate.md

* hf-bitsandbytes-integration cn done

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Proofread hf-bitsandbytes-integration.md

* Proofread: red-teaming.md

* Update: add red-teaming to zh/_blog.yml

* Update _blog.yml

* Update: add red-teaming to zh/_blog.yml

Fix: red-teaming title in zh/_blog.yml

* Fix: red-teaming PPLM translation

* deep-learning-with-proteins cn done

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Add: stackllama.md

* if blog translation completed

* Update unity-in-spaces.md

Add a link for AI game

* Update if.md

Fix “普罗大众” to “普惠大众”

* deep-learning-with-proteins cn done

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* add starcoder cn

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Update: formatting and punctuations of starcoder.md

* add starcoder cn

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update: proofreading zh/unity-in-spaces.md

* fix(annotated-diffusion.md): fix image shape desc in PIL and Tensor (huggingface#1080)

modifiy the comment after ToTensor with the correct image shape CHW

* Add text-to-video blog (huggingface#1058)

Adds an overview of text-to-video generative models, task specific challenges, datasets, and more.

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Fix broken link in text-to-video.md (huggingface#1083)

* Update: proofreading zh/unity-in-spaces.md

Fix: incorrect _blog.yml format

* Update: proofreading zh/deep-learning-with-proteins.md

* update ethics-diffusers-cn (#6)

* update ethics-diffusers

* update ethics-diffusers

---------

Co-authored-by: Zhongdong Yang <zhongdong_y@outlook.com>

* Update: proofreading zh/ethics-diffusers.md

* 1. introducing-csearch done (#11)

2. text-to-video done

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update: proofread zh/text-to-video.md

* Update: proofreading zh/introducing-csearch.md

* generative-ai-models-on-intel-cpu cn done (#13)

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Update: proofread zh/generative-ai-models-on-intel-cpu.md
Signed-off-by: Yang, Zhongdong <zhongdong_y@outlook.com>

* add starchat-alpha zh translation (#10)

* Preparing blogpost annoucing `safetensors` security audit + official support. (huggingface#1096)

* Preparing blogpost annoucing `safetensors` security audit + official
support.

* Taking into account comments + Grammarly.

* Update safetensors-official.md

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update safetensors-official.md

* Apply suggestions from code review

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Apply suggestions from code review

* Update safetensors-official.md

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Apply suggestions from code review

* Adding thumbnail.

* Include changes from Stella.

* Update safetensors-official.md

* Update with Stella's comments.

* Remove problematic sentence.

* Rename + some rephrasing.

* Apply suggestions from code review

Co-authored-by: DeltaPenrose <128761972+DeltaPenrose@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: DeltaPenrose <128761972+DeltaPenrose@users.noreply.github.com>

* Update safetensors-security-audit.md

Co-authored-by: DeltaPenrose <128761972+DeltaPenrose@users.noreply.github.com>

* Last fixes.

* Apply suggestions from code review

Co-authored-by: DeltaPenrose <128761972+DeltaPenrose@users.noreply.github.com>

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Co-authored-by: DeltaPenrose <128761972+DeltaPenrose@users.noreply.github.com>

* Hotfixing safetensors. (huggingface#1131)

* Removing the checklist formatting is busted. (huggingface#1132)

* Update safetensors-security-audit.md (huggingface#1134)

* [time series transformers] update dataloader API (huggingface#1135)

* update dataloader API

* revert comment

* add back Cached transform

* New post: Hugging Face and IBM (huggingface#1130)

* Initial version

* Minor fixes

* Update huggingface-and-ibm.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update huggingface-and-ibm.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Resize image

* Update blog index

---------

Co-authored-by: Julien Simon <julsimon@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Show authors of safetensors blog post (huggingface#1137)

Update: proofread zh/starchat-alpha.md

* add megatron-training & assisted-generation (#8)

* add megatron-training

* add megatron-training

* add megatron-training

* add megatron-training

* add assisted-generation

* add assisted-generation

* add assisted-generation

* Update: proofreading zh/assisted-generation

* Update: proofread zh/megatron-training.md

* rwkv model blog translation completed (#12)

* rwkv model blog translation completed

* add 3 additional parts in the blog tail

* Update: proofread zh/rwkv.md

* Fix: missing subtitle/notes for image references.

* encoder-decoder cn done (#14)

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Zhongdong Yang <zhongdong_y@outlook.com>

* Update: proofread zh/encoder-decoder.md

* constrained-beam-search cn done (#15)

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Update: proofread zh/constrained-beam-search.md

* Update: zh/unity-api.md + zh/unity-asr.md

* unity ai speech recognition blog translation completed

* add (GameObject) to attach its Chinese translation

* finish unity-api translation

* add unity series entry to zh/_blog.yml

* Update: proofread zh/unity-{api,asr}.md

* Update zh/falcon.md

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Update: zh/falcon.md

* instruction-tuning-sd cn done (#21)

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update: zh/instruction-tuning-sd.md

* fine-tune-whisper cn done (#23)

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update: zh/fine-tune-whisper.md

* add mms_adapters and policy (#22)

Update: zh/policy-ntia-rfc.md

* Update: refine zh/mms_adapters.md

Update: remove incompleted file

* Update: zh/llm-leaderboard.md, zh/autoformer.md

* add llm-leaderboard CN translation

* add CN translation for autoformer

* Update: proofreading zh/autoformer.md

* BridgeTower blog post (huggingface#1118)

* Update BridgeTower blog post (huggingface#1277)

* LLM Eval: minor typos and nits (huggingface#1263)

* Fix anchor link to custom pipeline section. (huggingface#485)

* Update: zh/llm-leaderboard.md, zh/autoformer.md

* add llm-leaderboard CN translation

* add CN translation for autoformer

Update: proofreading zh/autoformer.md

Update: proofreading zh/llm-leaderboard.md

* Update: proofreading zh/ethics-soc-4.md

* Update "How to deploy LLM" blog post to use `huggingface_hub` in example  (huggingface#1290)

* Use InferenceClient from huggingface_hub

* Update inference-endpoints-llm.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update BridgeTower blog post (huggingface#1295)

* Removed duplicate numbering (huggingface#1171)

* Update: zh/evaluating-mmlu-leaderboard.md

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Zhongdong Yang <zhongdong_y@outlook.com>

Update: proofreading zh/evaluating-mmlu-leaderboard.md

* Translate train-optimize-sd-intel.md to zh (#16)

* Translate "stackllama" into Chinese

* Create train-optimize-sd-intel.md

Add new

Update: zh/train-optimize-sd-intel.md

* Update: zh/dedup.md & zh/stable-diffusion-finetuning-intel.md

* dedup cn done

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* stable-diffusion-finetuning-intel cn done

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Update: proofread zh/stable-diffusion-finetuning-intel.md

* Update: proofread zh/dedup.md

* Update: zh/inference-endpoints-llm.md

Co-authored-by: Zhongdong Yang <zhongdong_y@outlook.com>

Update: proofread zh/inference-endpoints-llm.md

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Signed-off-by: Yang, Zhongdong <zhongdong_y@outlook.com>
Co-authored-by: innovation64 <liyang19991126@126.com>
Co-authored-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: SuSung-boy <872414318@qq.com>
Co-authored-by: Luke Cheng <2258420+chenglu@users.noreply.github.com>
Co-authored-by: yaoqih <40328311+yaoqih@users.noreply.github.com>
Co-authored-by: Shiliang Chen <36809537+csl122@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: 李洋 <45715979+innovation64@users.noreply.github.com>
Co-authored-by: Yao Matrix <yaoweifeng0301@126.com>
Co-authored-by: Hoi2022 <120370631+Hoi2022@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Co-authored-by: DeltaPenrose <128761972+DeltaPenrose@users.noreply.github.com>
Co-authored-by: Victor Muštar <victor.mustar@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Julien Simon <3436143+juliensimon@users.noreply.github.com>
Co-authored-by: Julien Simon <julsimon@huggingface.co>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: gxy-gxy <57594446+gxy-gxy@users.noreply.github.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Eswar Divi <76403422+EswarDivi@users.noreply.github.com>
Co-authored-by: Qi Zhang <82949744+Vermillion-de@users.noreply.github.com>
  • Loading branch information
1 parent ec4acf0 commit 048a187
Show file tree
Hide file tree
Showing 2 changed files with 208 additions and 0 deletions.
11 changes: 11 additions & 0 deletions zh/_blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -696,6 +696,17 @@
tags:
- ethics

- local: inference-endpoints-llm
title: "用 Hugging Face 推理端点部署 LLM"
author: philschmid
thumbnail: /blog/assets/155_inference_endpoints_llm/thumbnail.jpg
date: July 4, 2023
tags:
- guide
- llm
- apps
- inference

- local: stable-diffusion-finetuning-intel
title: "在英特尔 CPU 上微调 Stable Diffusion 模型"
author: juliensimon
Expand Down
197 changes: 197 additions & 0 deletions zh/inference-endpoints-llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
title: 用 Hugging Face 推理端点部署 LLM
thumbnail: /blog/assets/155_inference_endpoints_llm/thumbnail.jpg
authors:
- user: philschmid
translators:
- user: innovation64
- user: zhongdongy
proofreader: true
---

# 用 Hugging Face 推理端点部署 LLM

<!-- {blog_metadata} -->
<!-- {authors} -->

开源的 LLM,如 [Falcon](https://huggingface.co/tiiuae/falcon-40b)[(Open-)LLaMA](https://huggingface.co/openlm-research/open_llama_13b)[X-Gen](https://huggingface.co/Salesforce/xgen-7b-8k-base)[StarCoder](https://huggingface.co/bigcode/starcoder)[RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base),近几个月来取得了长足的进展,能够在某些用例中与闭源模型如 ChatGPT 或 GPT4 竞争。然而,有效且优化地部署这些模型仍然是一个挑战。

在这篇博客文章中,我们将向你展示如何将开源 LLM 部署到 [Hugging Face Inference Endpoints](https://ui.endpoints.huggingface.co/),这是我们的托管 SaaS 解决方案,可以轻松部署模型。此外,我们还将教你如何流式传输响应并测试我们端点的性能。那么,让我们开始吧!

1. [怎样部署 Falcon 40B instruct 模型](#1-how-to-deploy-falcon-40b-instruct)
2. [测试 LLM 端点](#2-test-the-llm-endpoint)
3. [用 javascript 和 python 进行流响应传输](#3-stream-responses-in-javascript-and-python)

在我们开始之前,让我们回顾一下关于推理端点的知识。

## 什么是 Hugging Face 推理端点

[Hugging Face 推理端点](https://ui.endpoints.huggingface.co/) 提供了一种简单、安全的方式来部署用于生产的机器学习模型。推理端点使开发人员和数据科学家都能够创建 AI 应用程序而无需管理基础设施: 简化部署过程为几次点击,包括使用自动扩展处理大量请求,通过缩减到零来降低基础设施成本,并提供高级安全性。

以下是 LLM 部署的一些最重要的特性:

1. [简单部署](https://huggingface.co/docs/inference-endpoints/index): 只需几次点击即可将模型部署为生产就绪的 API,无需处理基础设施或 MLOps。
2. [成本效益](https://huggingface.co/docs/inference-endpoints/autoscaling): 利用自动缩减到零的能力,通过在端点未使用时缩减基础设施来降低成本,同时根据端点的正常运行时间付费,确保成本效益。
3. [企业安全性](https://huggingface.co/docs/inference-endpoints/security): 在仅通过直接 VPC 连接可访问的安全离线端点中部署模型,由 SOC2 类型 2 认证支持,并提供 BAA 和 GDPR 数据处理协议,以增强数据安全性和合规性。
4. [LLM 优化](https://huggingface.co/text-generation-inference): 针对 LLM 进行了优化,通过自定义 transformers 代码和 Flash Attention 来实现高吞吐量和低延迟。
5. [全面的任务支持](https://huggingface.co/docs/inference-endpoints/supported_tasks): 开箱即用地支持 🤗 Transformers、Sentence-Transformers 和 Diffusers 任务和模型,并且易于定制以启用高级任务,如说话人分离或任何机器学习任务和库。

你可以在 [https://ui.endpoints.huggingface.co/](https://ui.endpoints.huggingface.co/) 开始使用推理端点。

## 1. 怎样部署 Falcon 40B instruct

要开始使用,你需要使用具有文件付款方式的用户或组织帐户登录 (你可以在 **[这里](https://huggingface.co/settings/billing)** 添加一个),然后访问推理端点 **[https://ui.endpoints.huggingface.co](https://ui.endpoints.huggingface.co/endpoints)**

然后,点击“新建端点”。选择仓库、云和区域,调整实例和安全设置,并在我们的情况下部署 `tiiuae/falcon-40b-instruct`

![Select Hugging Face Repository](https://huggingface.co/blog/assets/155_inference_endpoints_llm/repository.png "Select Hugging Face Repository")

推理端点会根据模型大小建议实例类型,该类型应足够大以运行模型。这里是 `4x NVIDIA T4` GPU。为了获得 LLM 的最佳性能,请将实例更改为 `GPU [xlarge] · 1x Nvidia A100`

_注意: 如果无法选择实例类型,则需要 [联系我们](mailto:api-enterprise@huggingface.co?subject=Quota%20increase%20HF%20Endpoints&body=Hello,%0D%0A%0D%0AI%20would%20like%20to%20request%20access/quota%20increase%20for%20{INSTANCE%20TYPE}%20for%20the%20following%20account%20{HF%20ACCOUNT}.) 并请求实例配额。_

![Select Instance Type](https://huggingface.co/blog/assets/155_inference_endpoints_llm/instance-selection.png "Select Instance Type")

然后,你可以点击“创建端点”来部署模型。10 分钟后,端点应该在线并可用于处理请求。

## 2. 测试 LLM 端点

端点概览提供了对推理小部件的访问,可以用来手动发送请求。这使你可以使用不同的输入快速测试你的端点并与团队成员共享。这些小部件不支持参数 - 在这种情况下,这会导致“较短的”生成。

![Test Inference Widget](https://huggingface.co/blog/assets/155_inference_endpoints_llm/widget.png "Test Inference Widget")

该小部件还会生成一个你可以使用的 cURL 命令。只需添加你的 `hf_xxx` 并进行测试。

```python
curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \
-X POST \
-d '{"inputs":"Once upon a time,"}' \
-H "Authorization: Bearer <hf_token>" \
-H "Content-Type: application/json"
```

你可以使用不同的参数来控制生成,将它们定义在有效负载的 `parameters` 属性中。截至目前,支持以下参数:

- `temperature`: 控制模型中的随机性。较低的值会使模型更确定性,较高的值会使模型更随机。默认值为 1.0。
- `max_new_tokens`: 要生成的最大 token 数。默认值为 20,最大值为 512。
- `repetition_penalty`: 控制重复的可能性。默认值为 `null`
- `seed`: 用于随机生成的种子。默认值为 `null`
- `stop`: 停止生成的 token 列表。当生成其中一个 token 时,生成将停止。
- `top_k`: 保留概率最高的词汇表 token 数以进行 top-k 过滤。默认值为 `null` ,禁用 top-k 过滤。
- `top_p`: 保留核心采样的参数最高概率词汇表 token 的累积概率,默认为 `null`
- `do_sample`: 是否使用采样; 否则使用贪婪解码。默认值为 `false`
- `best_of`: 生成 best_of 序列并返回一个最高 token 的 logprobs,默认为 `null`
- `details`: 是否返回有关生成的详细信息。默认值为 `false`
- `return_full_text`: 是否返回完整文本或仅返回生成部分。默认值为 `false`
- `truncate`: 是否将输入截断到模型的最大长度。默认值为 `true`
- `typical_p`: token 的典型概率。默认值为 `null`
- `watermark`: 用于生成的水印。默认值为 `false`

## 3. 用 javascript 和 python 进行流响应传输

使用 LLM 请求和生成文本可能是一个耗时且迭代的过程。改善用户体验的一个好方法是在生成 token 时将它们流式传输给用户。下面是两个使用 Python 和 JavaScript 流式传输 token 的示例。对于 Python,我们将使用 [Text Generation Inference 的客户端](https://github.com/huggingface/text-generation-inference/tree/main/clients/python),对于 JavaScript,我们将使用 [HuggingFace.js 库](https://huggingface.co/docs/huggingface.js/main/en/index)

### 使用 Python 流式传输请求

首先,你需要安装 `huggingface_hub` 库:

```python
pip install -U huggingface_hub
```

我们可以创建一个 `InferenceClient` ,提供我们的端点 URL 和凭据以及我们想要使用的超参数。

```python
from huggingface_hub import InferenceClient

# HF Inference Endpoints parameter
endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud"
hf_token = "hf_YOUR_TOKEN"

# Streaming Client
client = InferenceClient(endpoint_url, token=hf_token)

# generation parameter
gen_kwargs = dict(
max_new_tokens=512,
top_k=30,
top_p=0.9,
temperature=0.2,
repetition_penalty=1.02,
stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
)
# prompt
prompt = "What can you do in Nuremberg, Germany? Give me 3 Tips"

stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)

# yield each generated token
for r in stream:
# skip special tokens
if r.token.special:
continue
# stop if we encounter a stop sequence
if r.token.text in gen_kwargs["stop_sequences"]:
break
# yield the generated token
print(r.token.text, end = "")
# yield r.token.text
```

`print` 命令替换为 `yield` 或你想要将 token 流式传输到的函数。

![Python Streaming](assets/155_inference_endpoints_llm/python-stream.gif Python Streaming)

### 使用 Javascript 流式传输请求

首先你需要安装 `@huggingface/inference`

```python
npm install @huggingface/inference
```

我们可以创建一个 `HfInferenceEndpoint` ,提供我们的端点 URL 和凭据以及我们想要使用的超参数。

```jsx
import { HfInferenceEndpoint } from '@huggingface/inference'

const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')

//generation parameter
const gen_kwargs = {
max_new_tokens: 512,
top_k: 30,
top_p: 0.9,
temperature: 0.2,
repetition_penalty: 1.02,
stop_sequences: ['\nUser:', '<|endoftext|>', '</s>'],
}
// prompt
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'

const stream = hf.textGenerationStream({ inputs: prompt, parameters: gen_kwargs })
for await (const r of stream) {
// # skip special tokens
if (r.token.special) {
continue
}
// stop if we encounter a stop sequence
if (gen_kwargs['stop_sequences'].includes(r.token.text)) {
break
}
// yield the generated token
process.stdout.write(r.token.text)
}
```

`process.stdout` 调用替换为 `yield` 或你想要将 token 流式传输到的函数。

![Javascript Streaming](https://huggingface.co/blog/assets/155_inference_endpoints_llm/js-stream.gif "Javascript Streaming")

## 结论

在这篇博客文章中,我们向你展示了如何使用 Hugging Face 推理端点部署开源 LLM,如何使用高级参数控制文本生成,以及如何将响应流式传输到 Python 或 JavaScript 客户端以提高用户体验。通过使用 Hugging Face 推理端点,你可以只需几次点击即可将模型部署为生产就绪的 API,通过自动缩减到零来降低成本,并在 SOC2 类型 2 认证的支持下将模型部署到安全的离线端点。

---

感谢你的阅读!如果你有任何问题,请随时在 [Twitter](https://twitter.com/_philschmid)[LinkedIn](https://www.linkedin.com/in/philipp-schmid-a6a2bb196/) 上联系我。

0 comments on commit 048a187

Please sign in to comment.