-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Gaudi Backend #3055
Add Gaudi Backend #3055
Conversation
FYI @dacorvo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left several comments, mainly about removing files that we don't need in the Gaudi backend. There are probably more files that could be deleted (basically everything that is specific to cuda/rocm, e.g. many custom modelings for flash attention etc).
backends/gaudi/Makefile
Outdated
docker run -it \ | ||
--runtime=habana \ | ||
-e HABANA_VISIBLE_DEVICES=all \ | ||
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to always set this to true? For example for a single-device deployment, do we get the same performance with and without it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I think we could simplify the docker command a lot. I was planning on doing it in a refactoring later but I can do some refactoring now already if preferred.
Based on the 13 docker run commands in the TGI-Fork Readme. Here is the distribution (meaning how many of those docker run commands use a given argument)
Environment Variables
Environment Variable | Count | Percentage |
---|---|---|
HABANA_VISIBLE_DEVICES=all |
13/13 | 100% |
OMPI_MCA_btl_vader_single_copy_mechanism=none |
13/13 | 100% |
ENABLE_HPU_GRAPH=true |
13/13 | 100% |
LIMIT_HPU_GRAPH=true |
13/13 | 100% |
USE_FLASH_ATTENTION=true |
13/13 | 100% |
FLASH_ATTENTION_RECOMPUTE=true |
13/13 | 100% |
TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true |
10/13 | 76.9% |
PT_HPU_ENABLE_LAZY_COLLECTIVES=true |
8/13 | 61.5% |
I adapted the code now to align with those.
HABANA_VISIBLE_DEVICES
is set to all (previously unset)
OMPI_MCA_btl_vader_single_copy_mechanism
is set to none (previously unset)
ENABLE_HPU_GRAPH is unchanged
(was previously set to true so kept it)
LIMIT_HPU_GRAPH
is set True by default now (previously False)
USE_FLASH_ATTENTION
is set True by default now (previously False)
FLASH_ATTENTION_RECOMPUTE
is set True by default now (previously False)
PT_HPU_ENABLE_LAZY_COLLECTIVES
is set to True if model is sharded. This is always the case, when the model is sharded PT_HPU_ENABLE_LAZY_COLLECTIVES
needs to be set to True.
I believe TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN
should be set to false and not be included in the examples. Right now, all the examples set this variable to true. It was added for benchmarking huggingface#234. However, this should not be the default behavior when running the container. Indeed, it messed up any generation without max_token parameters.
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
If user does not use max_new_tokens then the generation continues and users does not get a proper response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mainly commenting about PT_HPU_ENABLE_LAZY_COLLECTIVES
as I'm not sure if it has any impact on single-device configurations. But if the latency is the same with and without it in a single-device deployment, then let's keep it.
Thanks for the review @regisss and @yao-matrix @regisss I wanted to do refactoring later on the server (cf. Future Improvements in PR description), but yeah, it's a great initiative to start it now! I also believe we can cut a lot of the stuff on the tgi-gaudi server. Should we continue more here, or would you leave it like that for another PR? For example:
(Btw I recommend Meld as a GUI for the diff. You can check the server from TGI (version 2.0.4) and compare it with the TGI-gaudi server to see which files were modified in the fork. I guess you can do that with |
@baptistecolle I think it's fine to leave it for another PR as it will be easier to review (this PR is pretty big already hehe). Regarding your questions:
|
wip(gaudi): fix typos wip(gaudi): refactor version numbers for pytorch and habana software to make it more flexible wip(gaudi): debugging the refactored server wip(gaudi): delete useless files fix(gaudi): server working after refactoring fix(gaudi): refactor and implement requested changes
I implemented the new changes, namely removing redundant files from the server and making the habana and pytorch version parameters for easier maintenance. Still more can be done, but this should be done in a new PR. #3055 (comment) One small additional refactoring was done: For example, before we add a command like that docker run -p 8080:80 -v $volume:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e HF_TOKEN=$hf_token -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice \
--ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id $model --sharded true \
--num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048 Now, it will be docker run -p 8080:80 -v $volume:/data --runtime=habana
-e HF_TOKEN=$hf_token --cap-add=sys_nice \
--ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id $model --sharded true \
--num-shard 8 --max-input-tokens 1024 --max-total-tokens 2048 More or less, this specifies the arguments that are mainly the same for all the commands. Again, I performed a sanity check and benchmarked the changes, and the performance is the same as that of the previous upstream implementation (token per second and time to first token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge PR @baptistecolle 🚀 🚀 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM had some nits.
What does this PR do?
This PR integrates the Intel Gaudi backend into TGI's main codebase. Previously, we maintained with Intel a separate fork for Gaudi devices at https://github.com/huggingface/tgi-gaudi. By incorporating this as a backend within TGI, we significantly improve maintainability and reduce development overhead.
Current Status and Implementation
The current tgi-gaudi fork has some API drift from standard TGI (divergence in the launcher, router, and client code)
This PR enables the Gaudi backend to work seamlessly with unmodified TGI components:
Now the Gaudi backend is only responsible for the Server aspect of TGI.
You can run the build the image for the Gaudi backend with
make -C backends/gaudi image
. For more information please refer to the Gaudi backend's README.md.To review the difference between this new server and the old one, the PR is structured with a first commit that just imported the server with zero modification from the tgi-gaudi fork. The next PRs then modify this server to make it work with the main TGI.
Differences with the TGI-Gaudi fork server
There is one key behavioral difference compared to the tgi-gaudi fork:
The tgi-gaudi fork uses
--max-batch-total-tokens
for warmup, which isn't available in standard TGI. However, since--max-batch-size
is available during warmup, we can compute the equivalent value:Migration Example
Instead of:
Use:
Where
--max-batch-size
(8) =--max-batch-total-tokens
(8192) /--max-total-tokens
(1024)Validation and Performance
We've validated that both non-shared and shared meta-llama/Llama-3.1-8B-Instruct models function correctly. Performance benchmarks using https://github.com/huggingface/inference-benchmarker show comparable results:
As expected, performance remains consistent since the underlying code is nearly identical across implementations.
Next Steps
When those next steps are done, tgi-gaudi will be deprecated, and all future development for Intel Gaudi can happen in the Gaudi backend on the main TGI repo.
ghcr.io/huggingface/text-generation-inference:3.1.0-gaudi
Future Improvements