-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Triton backend support #537
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just a few comments/questions. Lemme know when it's ready for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass. Overall, I like the structure that we went with here!
Nits aside, main comments here are around error-handling (and my own questions about what kind of assumptions are fair to make about the input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's merge this and use it then iterate on smaller PRs
🤩 🤩 🤩 |
Overview
This PR adds support for
Triton
as a backend for Truss. Specifically, this PR contains logic forconfig.yaml
andmodel.py
Logic around testing this flow will be in a follow-up PR. It's a significant testing suite and requires running tests within the Triton docker container.
Quickstart
Quickstart repo here
git clone https://github.com/aspctu/bert-triton-truss
truss image build-context ./bert-truss-context ./bert-truss
cd ./bert-truss-context
docker build ./
docker run --gpus=all -p8080:8080 -p8000:8000 -it (image id)
Follow the README.md in the repo above to invoke the model.
Introduction
Triton is a high-performance model serving backend developed by NVidia. For most models (outside of LLMs), it's advantageous to use Triton as the backend server. This is due to various server features that are attractive to maximizing GPU utilization and memory.
This PR introduces a simplified developer experience to enabling users to tap into some of this functionality within Truss. It's worth nothing that there is a lot of functionality in Triton that is not supported here (such as decoupled mode or ensemble models).
To enable Triton, a user needs to do a couple things:
config.yaml
to contain the following information (automatically done if the truss is created viatruss init
)model.py
that correspond to theInput
andOutput
of their model (example below)predict
function to accept aList[Input]
and produce aList[Output]
TODOs
truss image build
failing to do anything