Replies: 1 comment
-
Tabby's default implementation (llama.cpp) utilizes only a single GPU on a machine and is known not to deliver the best performance when serving a model with high parallelism. We recommend considering alternative model backends (e.g., tensorrt-llm, vllm) for a specific model when deploying at such a scale. P.S.: For enterprise customers, we do offer assistance in exploring the best practices for a specific model/GPU setup. If purchasing a license is an option, please consider booking our office hours at https://calendly.com/tabby_ml/chat-with-tabbyml |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The binary service I deployed in a Linux environment has a GPU utilization consistently below 30%, but there are frequent instances of particularly slow code completion. Does anyone have any methods to improve efficiency?
My machine configuration: V100 * 8.
Beta Was this translation helpful? Give feedback.
All reactions