Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gaudi Backend #3055

Merged
merged 5 commits into from
Feb 28, 2025
Merged

Add Gaudi Backend #3055

merged 5 commits into from
Feb 28, 2025

Conversation

baptistecolle
Copy link
Collaborator

@baptistecolle baptistecolle commented Feb 25, 2025

What does this PR do?

This PR integrates the Intel Gaudi backend into TGI's main codebase. Previously, we maintained with Intel a separate fork for Gaudi devices at https://github.com/huggingface/tgi-gaudi. By incorporating this as a backend within TGI, we significantly improve maintainability and reduce development overhead.

Current Status and Implementation

The current tgi-gaudi fork has some API drift from standard TGI (divergence in the launcher, router, and client code)

This PR enables the Gaudi backend to work seamlessly with unmodified TGI components:

  • Launcher
  • Router
  • Client
    Now the Gaudi backend is only responsible for the Server aspect of TGI.

You can run the build the image for the Gaudi backend with make -C backends/gaudi image. For more information please refer to the Gaudi backend's README.md.

To review the difference between this new server and the old one, the PR is structured with a first commit that just imported the server with zero modification from the tgi-gaudi fork. The next PRs then modify this server to make it work with the main TGI.

Differences with the TGI-Gaudi fork server

There is one key behavioral difference compared to the tgi-gaudi fork:

The tgi-gaudi fork uses --max-batch-total-tokens for warmup, which isn't available in standard TGI. However, since --max-batch-size is available during warmup, we can compute the equivalent value:

max-batch-size = max-batch-total-tokens / max-total-tokens

Migration Example

Instead of:

docker run ... tgi_gaudi ... --max-total-tokens 1024 --max-batch-total-tokens 8192

Use:

docker run ... tgi_gaudi ... --max-total-tokens 1024 --max-batch-size 8

Where --max-batch-size (8) = --max-batch-total-tokens (8192) / --max-total-tokens (1024)

Validation and Performance

We've validated that both non-shared and shared meta-llama/Llama-3.1-8B-Instruct models function correctly. Performance benchmarks using https://github.com/huggingface/inference-benchmarker show comparable results:

Metric Gaudi Backend TGI Upstream
Token throughput (tokens/sec) 125 126
Time to first token (p50, ms) 73,349 58,469

As expected, performance remains consistent since the underlying code is nearly identical across implementations.

Next Steps

When those next steps are done, tgi-gaudi will be deprecated, and all future development for Intel Gaudi can happen in the Gaudi backend on the main TGI repo.

  • Manually validate all models supported by the Gaudi backend (https://github.com/huggingface/tgi-gaudi). Unfortunately, there is no automated testing on tgi-gaudi for the moment.
  • Add CI pipeline in TGI to publish the image under ghcr.io/huggingface/text-generation-inference:3.1.0-gaudi
  • Create Gaudi-specific documentation in the TGI docs

Future Improvements

  • Refactor the Gaudi server code to remove unsupported models and not used code. The current implementation inherits unnecessary code from the upstream TGI NVIDIA backend. For example, the tgi-gaudi server currently has modeling code for idefics, but this is not supported on Gaudi hardware. A more targeted approach similar to the Neuron server in TGI or optimum-tpu would improve maintainability.
  • Implement automated integration tests for the container, as right now, there is no automated test on tgi-gaudi. We could use AWS Gaudi1 instances (dl1.24xlarge) as they are easily available on AWS compared to Gaudi2.

@baptistecolle
Copy link
Collaborator Author

FYI @dacorvo

@baptistecolle baptistecolle marked this pull request as ready for review February 25, 2025 12:55
@baptistecolle baptistecolle marked this pull request as draft February 25, 2025 13:05
@baptistecolle baptistecolle marked this pull request as ready for review February 25, 2025 13:16
@baptistecolle baptistecolle marked this pull request as draft February 25, 2025 14:56
@baptistecolle baptistecolle marked this pull request as ready for review February 25, 2025 15:25
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left several comments, mainly about removing files that we don't need in the Gaudi backend. There are probably more files that could be deleted (basically everything that is specific to cuda/rocm, e.g. many custom modelings for flash attention etc).

docker run -it \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to always set this to true? For example for a single-device deployment, do we get the same performance with and without it?

Copy link
Collaborator Author

@baptistecolle baptistecolle Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I think we could simplify the docker command a lot. I was planning on doing it in a refactoring later but I can do some refactoring now already if preferred.

Based on the 13 docker run commands in the TGI-Fork Readme. Here is the distribution (meaning how many of those docker run commands use a given argument)

Environment Variables

Environment Variable Count Percentage
HABANA_VISIBLE_DEVICES=all 13/13 100%
OMPI_MCA_btl_vader_single_copy_mechanism=none 13/13 100%
ENABLE_HPU_GRAPH=true 13/13 100%
LIMIT_HPU_GRAPH=true 13/13 100%
USE_FLASH_ATTENTION=true 13/13 100%
FLASH_ATTENTION_RECOMPUTE=true 13/13 100%
TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true 10/13 76.9%
PT_HPU_ENABLE_LAZY_COLLECTIVES=true 8/13 61.5%

I adapted the code now to align with those.
HABANA_VISIBLE_DEVICES is set to all (previously unset)
OMPI_MCA_btl_vader_single_copy_mechanism is set to none (previously unset)
ENABLE_HPU_GRAPH is unchanged (was previously set to true so kept it)
LIMIT_HPU_GRAPH is set True by default now (previously False)
USE_FLASH_ATTENTION is set True by default now (previously False)
FLASH_ATTENTION_RECOMPUTE is set True by default now (previously False)
PT_HPU_ENABLE_LAZY_COLLECTIVES is set to True if model is sharded. This is always the case, when the model is sharded PT_HPU_ENABLE_LAZY_COLLECTIVES needs to be set to True.

I believe TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN should be set to false and not be included in the examples. Right now, all the examples set this variable to true. It was added for benchmarking huggingface#234. However, this should not be the default behavior when running the container. Indeed, it messed up any generation without max_token parameters.

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

If user does not use max_new_tokens then the generation continues and users does not get a proper response

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly commenting about PT_HPU_ENABLE_LAZY_COLLECTIVES as I'm not sure if it has any impact on single-device configurations. But if the latency is the same with and without it in a single-device deployment, then let's keep it.

@baptistecolle baptistecolle marked this pull request as draft February 26, 2025 04:46
@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Feb 26, 2025

Thanks for the review @regisss and @yao-matrix

@regisss I wanted to do refactoring later on the server (cf. Future Improvements in PR description), but yeah, it's a great initiative to start it now! I also believe we can cut a lot of the stuff on the tgi-gaudi server. Should we continue more here, or would you leave it like that for another PR?
@yao-matrix Maybe you also want to comment on that?

For example:

(Btw I recommend Meld as a GUI for the diff. You can check the server from TGI (version 2.0.4) and compare it with the TGI-gaudi server to see which files were modified in the fork. I guess you can do that with diff, but the GUI is quite nice for this task)

@regisss
Copy link
Collaborator

regisss commented Feb 26, 2025

@baptistecolle I think it's fine to leave it for another PR as it will be easier to review (this PR is pretty big already hehe).

Regarding your questions:

  • I'm 99% sure we don't need any of the code in the layers folder
  • I think we just need causal_lm.py and vlm_causal_lm.py. Basically, we need files that import from optimum-habana.
  • Custom kernels we don't need. And probably some other things but we can do that later.

wip(gaudi): fix typos

wip(gaudi): refactor version numbers for pytorch and habana software to make it more flexible

wip(gaudi): debugging the refactored server

wip(gaudi): delete useless files

fix(gaudi): server working after refactoring

fix(gaudi): refactor and implement requested changes
@baptistecolle
Copy link
Collaborator Author

I implemented the new changes, namely removing redundant files from the server and making the habana and pytorch version parameters for easier maintenance. Still more can be done, but this should be done in a new PR. #3055 (comment)

One small additional refactoring was done:
I simplified the docker run commands to make it more user-friendly based on that comment #3055 (comment)

For example, before we add a command like that

docker run -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
 -e  HF_TOKEN=$hf_token -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true \
 -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice \
 --ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id $model --sharded true \
 --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048

Now, it will be

docker run -p 8080:80 -v $volume:/data --runtime=habana 
 -e  HF_TOKEN=$hf_token --cap-add=sys_nice \
 --ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id $model --sharded true \
 --num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048

More or less, this specifies the arguments that are mainly the same for all the commands.

Again, I performed a sanity check and benchmarked the changes, and the performance is the same as that of the previous upstream implementation (token per second and time to first token)

@baptistecolle baptistecolle marked this pull request as ready for review February 27, 2025 13:10
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge PR @baptistecolle 🚀 🚀 🚀

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM had some nits.

@baptistecolle baptistecolle merged commit 683ff53 into main Feb 28, 2025
28 checks passed
@baptistecolle baptistecolle deleted the add-gaudi-backend branch February 28, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants