Add Gaudi Backend #3055

baptistecolle · 2025-02-25T12:51:24Z

What does this PR do?

This PR integrates the Intel Gaudi backend into TGI's main codebase. Previously, we maintained with Intel a separate fork for Gaudi devices at https://github.com/huggingface/tgi-gaudi. By incorporating this as a backend within TGI, we significantly improve maintainability and reduce development overhead.

Current Status and Implementation

The current tgi-gaudi fork has some API drift from standard TGI (divergence in the launcher, router, and client code)

This PR enables the Gaudi backend to work seamlessly with unmodified TGI components:

Launcher
Router
Client
Now the Gaudi backend is only responsible for the Server aspect of TGI.

You can run the build the image for the Gaudi backend with make -C backends/gaudi image. For more information please refer to the Gaudi backend's README.md.

To review the difference between this new server and the old one, the PR is structured with a first commit that just imported the server with zero modification from the tgi-gaudi fork. The next PRs then modify this server to make it work with the main TGI.

Differences with the TGI-Gaudi fork server

There is one key behavioral difference compared to the tgi-gaudi fork:

The tgi-gaudi fork uses --max-batch-total-tokens for warmup, which isn't available in standard TGI. However, since --max-batch-size is available during warmup, we can compute the equivalent value:

max-batch-size = max-batch-total-tokens / max-total-tokens

Migration Example

Instead of:

docker run ... tgi_gaudi ... --max-total-tokens 1024 --max-batch-total-tokens 8192

Use:

docker run ... tgi_gaudi ... --max-total-tokens 1024 --max-batch-size 8

Where --max-batch-size (8) = --max-batch-total-tokens (8192) / --max-total-tokens (1024)

Validation and Performance

We've validated that both non-shared and shared meta-llama/Llama-3.1-8B-Instruct models function correctly. Performance benchmarks using https://github.com/huggingface/inference-benchmarker show comparable results:

Metric	Gaudi Backend	TGI Upstream
Token throughput (tokens/sec)	125	126
Time to first token (p50, ms)	73,349	58,469

As expected, performance remains consistent since the underlying code is nearly identical across implementations.

Next Steps

When those next steps are done, tgi-gaudi will be deprecated, and all future development for Intel Gaudi can happen in the Gaudi backend on the main TGI repo.

Manually validate all models supported by the Gaudi backend (https://github.com/huggingface/tgi-gaudi). Unfortunately, there is no automated testing on tgi-gaudi for the moment.
Add CI pipeline in TGI to publish the image under ghcr.io/huggingface/text-generation-inference:3.1.0-gaudi
Create Gaudi-specific documentation in the TGI docs

Future Improvements

Refactor the Gaudi server code to remove unsupported models and not used code. The current implementation inherits unnecessary code from the upstream TGI NVIDIA backend. For example, the tgi-gaudi server currently has modeling code for idefics, but this is not supported on Gaudi hardware. A more targeted approach similar to the Neuron server in TGI or optimum-tpu would improve maintainability.
Implement automated integration tests for the container, as right now, there is no automated test on tgi-gaudi. We could use AWS Gaudi1 instances (dl1.24xlarge) as they are easily available on AWS compared to Gaudi2.

baptistecolle · 2025-02-25T12:55:34Z

FYI @dacorvo

regisss

I left several comments, mainly about removing files that we don't need in the Gaudi backend. There are probably more files that could be deleted (basically everything that is specific to cuda/rocm, e.g. many custom modelings for flash attention etc).

regisss · 2025-02-25T15:22:23Z

backends/gaudi/Makefile

+		docker run -it \
+		--runtime=habana \
+		-e HABANA_VISIBLE_DEVICES=all \
+		-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \


Do we want to always set this to true? For example for a single-device deployment, do we get the same performance with and without it?

Yes! I think we could simplify the docker command a lot. I was planning on doing it in a refactoring later but I can do some refactoring now already if preferred.

Based on the 13 docker run commands in the TGI-Fork Readme. Here is the distribution (meaning how many of those docker run commands use a given argument)

Environment Variables

Environment Variable Count Percentage

HABANA_VISIBLE_DEVICES=all 13/13 100%

OMPI_MCA_btl_vader_single_copy_mechanism=none 13/13 100%

ENABLE_HPU_GRAPH=true 13/13 100%

LIMIT_HPU_GRAPH=true 13/13 100%

USE_FLASH_ATTENTION=true 13/13 100%

FLASH_ATTENTION_RECOMPUTE=true 13/13 100%

TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true 10/13 76.9%

PT_HPU_ENABLE_LAZY_COLLECTIVES=true 8/13 61.5%

I adapted the code now to align with those.
HABANA_VISIBLE_DEVICES is set to all (previously unset)
OMPI_MCA_btl_vader_single_copy_mechanism is set to none (previously unset)
ENABLE_HPU_GRAPH is unchanged (was previously set to true so kept it)
LIMIT_HPU_GRAPH is set True by default now (previously False)
USE_FLASH_ATTENTION is set True by default now (previously False)
FLASH_ATTENTION_RECOMPUTE is set True by default now (previously False)
PT_HPU_ENABLE_LAZY_COLLECTIVES is set to True if model is sharded. This is always the case, when the model is sharded PT_HPU_ENABLE_LAZY_COLLECTIVES needs to be set to True.

I believe TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN should be set to false and not be included in the examples. Right now, all the examples set this variable to true. It was added for benchmarking huggingface#234. However, this should not be the default behavior when running the container. Indeed, it messed up any generation without max_token parameters.

curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json'

If user does not use max_new_tokens then the generation continues and users does not get a proper response

I was mainly commenting about PT_HPU_ENABLE_LAZY_COLLECTIVES as I'm not sure if it has any impact on single-device configurations. But if the latency is the same with and without it in a single-device deployment, then let's keep it.

Dockerfile_gaudi

backends/gaudi/Makefile

backends/gaudi/README.md

backends/gaudi/server/pyproject.toml

backends/gaudi/server/requirements_cuda.txt

backends/gaudi/server/requirements_intel.txt

backends/gaudi/server/requirements_rocm.txt

backends/gaudi/server/text_generation_server/utils/version.py

baptistecolle · 2025-02-26T06:23:55Z

Thanks for the review @regisss and @yao-matrix

@regisss I wanted to do refactoring later on the server (cf. Future Improvements in PR description), but yeah, it's a great initiative to start it now! I also believe we can cut a lot of the stuff on the tgi-gaudi server. Should we continue more here, or would you leave it like that for another PR?
@yao-matrix Maybe you also want to comment on that?

For example:

do we support or use all the layers here? https://github.com/huggingface/tgi-gaudi/tree/habana-main/server/text_generation_server/layers
Should we remove some models here as those are not in the supported model list? https://github.com/huggingface/tgi-gaudi/tree/habana-main/server/text_generation_server/models
Is there more stuff that can/should be removed to make the server leaner and thus easier to maintain?

(Btw I recommend Meld as a GUI for the diff. You can check the server from TGI (version 2.0.4) and compare it with the TGI-gaudi server to see which files were modified in the fork. I guess you can do that with diff, but the GUI is quite nice for this task)

regisss · 2025-02-26T10:33:04Z

@baptistecolle I think it's fine to leave it for another PR as it will be easier to review (this PR is pretty big already hehe).

Regarding your questions:

I'm 99% sure we don't need any of the code in the layers folder
I think we just need causal_lm.py and vlm_causal_lm.py. Basically, we need files that import from optimum-habana.
Custom kernels we don't need. And probably some other things but we can do that later.

wip(gaudi): fix typos wip(gaudi): refactor version numbers for pytorch and habana software to make it more flexible wip(gaudi): debugging the refactored server wip(gaudi): delete useless files fix(gaudi): server working after refactoring fix(gaudi): refactor and implement requested changes

baptistecolle · 2025-02-27T13:10:30Z

I implemented the new changes, namely removing redundant files from the server and making the habana and pytorch version parameters for easier maintenance. Still more can be done, but this should be done in a new PR. #3055 (comment)

One small additional refactoring was done:
I simplified the docker run commands to make it more user-friendly based on that comment #3055 (comment)

For example, before we add a command like that

docker run -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
 -e  HF_TOKEN=$hf_token -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true \
 -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice \
 --ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id $model --sharded true \
 --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048

Now, it will be

docker run -p 8080:80 -v $volume:/data --runtime=habana 
 -e  HF_TOKEN=$hf_token --cap-add=sys_nice \
 --ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id $model --sharded true \
 --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048

More or less, this specifies the arguments that are mainly the same for all the commands.

Again, I performed a sanity check and benchmarked the changes, and the performance is the same as that of the previous upstream implementation (token per second and time to first token)

regisss

Huge PR @baptistecolle 🚀 🚀 🚀

Narsil

LGTM had some nits.

Ubuntu and others added 2 commits February 25, 2025 12:08

wip(gaudi): import server and dockerfile from tgi-gaudi fork

cc754c4

feat(gaudi): new gaudi backend working

c08005a

baptistecolle requested review from Narsil and regisss February 25, 2025 12:55

baptistecolle marked this pull request as ready for review February 25, 2025 12:55

baptistecolle marked this pull request as draft February 25, 2025 13:05

fix: fix style

31535bc

baptistecolle marked this pull request as ready for review February 25, 2025 13:16

baptistecolle marked this pull request as draft February 25, 2025 14:56

fix prehooks issues

77dca4d

baptistecolle marked this pull request as ready for review February 25, 2025 15:25

regisss reviewed Feb 25, 2025

View reviewed changes

baptistecolle marked this pull request as draft February 26, 2025 04:46

baptistecolle marked this pull request as ready for review February 27, 2025 13:10

baptistecolle requested review from regisss and yao-matrix February 27, 2025 13:10

regisss approved these changes Feb 27, 2025

View reviewed changes

Narsil approved these changes Feb 28, 2025

View reviewed changes

baptistecolle merged commit 683ff53 into main Feb 28, 2025
28 checks passed

baptistecolle deleted the add-gaudi-backend branch February 28, 2025 11:15

This was referenced Mar 10, 2025

Deprecate TGI Gaudi fork #3092

Closed

Deprecate TGI Gaudi fork in favor of the main TGI repo huggingface/tgi-gaudi#287

Merged

baptistecolle mentioned this pull request Mar 20, 2025

Gaudi: Fix llava-next and mllama crash issue #3127

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Gaudi Backend #3055

Add Gaudi Backend #3055

Uh oh!

baptistecolle commented Feb 25, 2025 •

edited

Loading

Uh oh!

baptistecolle commented Feb 25, 2025

Uh oh!

regisss left a comment

Uh oh!

regisss Feb 25, 2025

Uh oh!

baptistecolle Feb 26, 2025 •

edited

Loading

Uh oh!

regisss Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baptistecolle commented Feb 26, 2025 •

edited

Loading

Uh oh!

regisss commented Feb 26, 2025

Uh oh!

baptistecolle commented Feb 27, 2025

Uh oh!

regisss left a comment

Uh oh!

Narsil left a comment

Uh oh!

Uh oh!

Uh oh!

Environment Variable	Count	Percentage
`HABANA_VISIBLE_DEVICES=all`	13/13	100%
`OMPI_MCA_btl_vader_single_copy_mechanism=none`	13/13	100%
`ENABLE_HPU_GRAPH=true`	13/13	100%
`LIMIT_HPU_GRAPH=true`	13/13	100%
`USE_FLASH_ATTENTION=true`	13/13	100%
`FLASH_ATTENTION_RECOMPUTE=true`	13/13	100%
`TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true`	10/13	76.9%
`PT_HPU_ENABLE_LAZY_COLLECTIVES=true`	8/13	61.5%

Add Gaudi Backend #3055

Add Gaudi Backend #3055

Uh oh!

Conversation

baptistecolle commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Current Status and Implementation

Differences with the TGI-Gaudi fork server

Migration Example

Validation and Performance

Next Steps

Future Improvements

Uh oh!

baptistecolle commented Feb 25, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

regisss Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

baptistecolle Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Environment Variables

Uh oh!

regisss Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baptistecolle commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss commented Feb 26, 2025

Uh oh!

baptistecolle commented Feb 27, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

baptistecolle commented Feb 25, 2025 •

edited

Loading

baptistecolle Feb 26, 2025 •

edited

Loading

baptistecolle commented Feb 26, 2025 •

edited

Loading