Skip to content

Commit 8efa589

Browse files
committed
Add the tgi-gaud blog of intel
Signed-off-by: yuanwu <yuan.wu@intel.com>
1 parent efabb18 commit 8efa589

File tree

1 file changed

+193
-0
lines changed

1 file changed

+193
-0
lines changed

Diff for: intel-tgi-gaudi.md

+193
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# 🚀 Intel Gaudi Meets Hugging Face: Supercharging Text Generation Inference
2+
3+
We’re thrilled to announce that **Intel Gaudi accelerators** are now integrated into Hugging Face’s [**Text Generation Inference (TGI)**](https://github.com/huggingface/text-generation-inference) project! This collaboration brings together the power of Intel’s high-performance AI hardware and Hugging Face’s state-of-the-art NLP software stack, enabling faster, more efficient, and scalable text generation for everyone.
4+
5+
Whether you’re building chatbots, generating creative content, or deploying large language models (LLMs) in production, this integration unlocks new possibilities for performance and cost-efficiency. Let’s dive into the details!
6+
7+
---
8+
9+
## 🤖 What is Text Generation Inference (TGI)?
10+
11+
Hugging Face’s **Text Generation Inference** is an open-source project designed to make deploying and serving large language models (LLMs) for text generation as seamless as possible. It powers popular models like GPT, T5, and BLOOM, providing features like:
12+
13+
- **High-performance inference**: Optimized for low latency and high throughput.
14+
- **Scalability**: Built to handle large-scale deployments.
15+
- **Ease of use**: Simple APIs and integrations for developers.
16+
17+
With TGI, you can deploy LLMs in production with confidence, knowing you’re leveraging the latest advancements in inference optimization.
18+
19+
---
20+
21+
## 🚀 Introducing Intel Gaudi Accelerators
22+
23+
Intel Gaudi accelerators are designed to deliver exceptional performance for AI workloads, particularly in training and inference for deep learning models. With features like high memory bandwidth, efficient tensor processing, and scalability across multiple devices, Gaudi accelerators are a perfect match for demanding NLP tasks like text generation.
24+
25+
By integrating Gaudi into TGI, we’re enabling users to:
26+
27+
- **Reduce inference costs**: Gaudi’s efficiency translates to lower operational expenses.
28+
- **Scale seamlessly**: Handle larger models and higher request volumes with ease.
29+
- **Achieve faster response times**: Optimized hardware for faster text generation.
30+
31+
---
32+
33+
## 🛠️ How It Works
34+
35+
The integration of Intel Gaudi into TGI leverages the **Habana SynapseAI SDK**, which provides optimized libraries and tools for running AI workloads on Gaudi hardware. Here’s how it works under the hood:
36+
37+
1. **Model Optimization**: TGI now supports Gaudi’s custom kernels and optimizations, ensuring that text generation models run efficiently on Gaudi accelerators.
38+
2. **Seamless Deployment**: With just a few configuration changes, you can deploy your favorite Hugging Face models on Gaudi-powered infrastructure.
39+
3. **Scalable Inference**: Gaudi’s architecture allows for multi-device setups, enabling you to scale inference horizontally as your needs grow.
40+
41+
---
42+
43+
## 🚀 Getting Started with TGI on Intel Gaudi
44+
45+
Ready to try it out? Here’s a quick guide to deploying a text generation model on Intel Gaudi using TGI:
46+
47+
### Step 1: Build tgi-gaudi image
48+
Ensure you have access to a Gaudi accelerator and install the required dependencies:
49+
50+
```bash
51+
# Build Text Generation Inference image with Gaudi support
52+
git clone https://github.com/huggingface/text-generation-inference.git
53+
cd text-generation-inference/backends/gaudi
54+
make image
55+
```
56+
57+
### Step 2: Deploy Your Model
58+
Use the TGI CLI to deploy a model on Gaudi:
59+
60+
```bash
61+
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
62+
HF_HOME=<your huggingface home directory>
63+
HF_TOKEN=<your huggingface token>
64+
docker run -it -p 8080:80 \
65+
--runtime=habana \
66+
-v $HF_HOME:/data \
67+
-e HABANA_VISIBLE_DEVICES=all \
68+
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
69+
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
70+
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
71+
-e PREFILL_BATCH_BUCKET_SIZE=16 \
72+
-e BATCH_BUCKET_SIZE=16 \
73+
-e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \
74+
-e ENABLE_HPU_GRAPH=true \
75+
-e LIMIT_HPU_GRAPH=true \
76+
-e USE_FLASH_ATTENTION=true \
77+
-e FLASH_ATTENTION_RECOMPUTE=true \
78+
--cap-add=sys_nice \
79+
--ipc=host \
80+
tgi-gaudi:latest --model-id $MODEL \
81+
--max-input-length 1024 --max-total-tokens 2048 \
82+
--max-batch-prefill-tokens 65536 --max-batch-size 64 \
83+
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 256
84+
```
85+
86+
### Step 3: Generate Text
87+
Send requests to your deployed model using the TGI API:
88+
89+
```bash
90+
curl localhost:8080/v1/chat/completions \
91+
-X POST \
92+
-d '{
93+
"model": "tgi",
94+
"messages": [
95+
{
96+
"role": "system",
97+
"content": "You are a helpful assistant."
98+
},
99+
{
100+
"role": "user",
101+
"content": "What is deep learning?"
102+
}
103+
],
104+
"stream": true,
105+
"max_tokens": 20
106+
}' \
107+
-H 'Content-Type: application/json'
108+
```
109+
110+
---
111+
112+
## 📊 Performance Benchmarks
113+
### Step 1: Deploy meta-llama/Meta-Llama-3.1-8B-Instruct
114+
According to the workload you are running, adjust the parameters such as max_batch_size, max_input_length, etc., and then deploy the model.
115+
```bash
116+
MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
117+
HF_HOME=<your huggingface home directory>
118+
HF_TOKEN=<your huggingface token>
119+
120+
docker run -it -p 8080:80 \
121+
--runtime=habana \
122+
-v $HF_HOME:/data \
123+
-e HABANA_VISIBLE_DEVICES=all \
124+
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
125+
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
126+
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
127+
-e PREFILL_BATCH_BUCKET_SIZE=16 \
128+
-e BATCH_BUCKET_SIZE=16 \
129+
-e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \
130+
-e ENABLE_HPU_GRAPH=true \
131+
-e LIMIT_HPU_GRAPH=true \
132+
-e USE_FLASH_ATTENTION=true \
133+
-e FLASH_ATTENTION_RECOMPUTE=true \
134+
--cap-add=sys_nice \
135+
--ipc=host \
136+
tgi-gaudi:latest --model-id $MODEL \
137+
--max-input-length 512 --max-total-tokens 1536 \
138+
--max-batch-prefill-tokens 131072 --max-batch-size 256 \
139+
--max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512
140+
```
141+
142+
### Step2: Run the inference-benchmarker
143+
```bash
144+
MODEL=meta-llama/Llama-3.1-8B-Instruct
145+
HF_TOKEN=<your huggingface token>
146+
RESULT=<directory of the result data>
147+
docker run \
148+
--rm \
149+
-it \
150+
--net host \
151+
--cap-add=sys_nice \
152+
-v $RESULT:/opt/inference-benchmarker/results \
153+
-e "HF_TOKEN=$HF_TOKEN" \
154+
ghcr.io/huggingface/inference-benchmarker:latest \
155+
inference-benchmarker \
156+
--tokenizer-name "$MODEL" \
157+
--profile fixed-length \
158+
--url http://localhost:8080
159+
```
160+
161+
| Benchmark | QPS | E2E Latency (avg) | TTFT (avg) | ITL (avg) | Throughput | Error Rate | Successful Requests | Prompt tokens per req (avg) | Decoded tokens per req (avg) |
162+
|--------------------|------------|-------------------|------------|-----------|--------------------|------------|---------------------|-----------------------------|------------------------------|
163+
| warmup | 0.13 req/s | 7.68 sec | 171.90 ms | 9.40 ms | 104.11 tokens/sec | 0.00% | 3/3 | 200.00 | 800.00 |
164+
| throughput | 4.30 req/s | 25.65 sec | 656.54 ms | 32.96 ms | 3253.86 tokens/sec | 0.00% | 518/518 | 200.00 | 756.78 |
165+
| constant@0.52req/s | 0.49 req/s | 8.14 sec | 175.20 ms | 10.30 ms | 379.90 tokens/sec | 0.00% | 57/57 | 200.00 | 774.46 |
166+
| constant@1.03req/s | 0.97 req/s | 8.81 sec | 175.69 ms | 11.42 ms | 730.78 tokens/sec | 0.00% | 114/114 | 200.00 | 756.49 |
167+
| constant@1.55req/s | 1.42 req/s | 11.53 sec | 179.17 ms | 15.00 ms | 1078.02 tokens/sec | 0.00% | 168/168 | 200.00 | 757.45 |
168+
| constant@2.06req/s | 1.86 req/s | 13.47 sec | 179.41 ms | 17.53 ms | 1408.29 tokens/sec | 0.00% | 219/219 | 200.00 | 758.79 |
169+
| constant@2.58req/s | 2.11 req/s | 20.39 sec | 183.98 ms | 26.50 ms | 1611.33 tokens/sec | 0.00% | 252/252 | 200.00 | 763.28 |
170+
| constant@3.10req/s | 2.24 req/s | 31.81 sec | 191.35 ms | 41.75 ms | 1701.30 tokens/sec | 0.00% | 265/265 | 200.00 | 759.18 |
171+
| constant@3.61req/s | 2.36 req/s | 36.68 sec | 285.70 ms | 47.94 ms | 1796.89 tokens/sec | 0.00% | 283/283 | 200.00 | 759.86 |
172+
| constant@4.13req/s | 2.67 req/s | 35.12 sec | 306.93 ms | 46.37 ms | 2004.35 tokens/sec | 0.00% | 311/311 | 200.00 | 751.81 |
173+
| constant@4.64req/s | 2.79 req/s | 33.87 sec | 315.08 ms | 44.51 ms | 2102.61 tokens/sec | 0.00% | 332/332 | 200.00 | 754.55 |
174+
| constant@5.16req/s | 3.01 req/s | 32.82 sec | 317.72 ms | 42.91 ms | 2280.97 tokens/sec | 0.00% | 355/355 | 200.00 | 758.23 |
175+
---
176+
177+
## 🌟 What’s Next?
178+
179+
This integration is just the beginning of our collaboration with Intel. We’re excited to continue working together to bring even more optimizations and features to the Hugging Face ecosystem. Stay tuned for updates on:
180+
181+
- **Support for more models**: Expanding Gaudi compatibility to additional architectures.
182+
- **Enhanced tooling**: Improved developer experience for deploying on Gaudi.
183+
- **Community contributions**: Open-source contributions to make Gaudi accessible to everyone.
184+
185+
---
186+
187+
## 🎉 Join the Revolution
188+
189+
We can’t wait to see what you build with Hugging Face’s Text Generation Inference and Intel Gaudi accelerators. Whether you’re a researcher, developer, or enterprise, this integration opens up new possibilities for scaling and optimizing your text generation workflows.
190+
191+
Try it out today and let us know what you think! Share your feedback, benchmarks, and use cases with us on [GitHub](https://github.com/huggingface/text-generation-inference) or [Twitter](https://twitter.com/huggingface).
192+
193+
Happy text generating! 🚀

0 commit comments

Comments
 (0)