Consultation on code completion performance optimization issues. #2939

5bug · 2024-08-05T09:42:18Z

5bug
Aug 5, 2024

The binary service I deployed in a Linux environment has a GPU utilization consistently below 30%, but there are frequent instances of particularly slow code completion. Does anyone have any methods to improve efficiency?

My machine configuration: V100 * 8.

wsxiaoys · 2024-08-05T17:18:52Z

wsxiaoys
Aug 5, 2024
Maintainer

Tabby's default implementation (llama.cpp) utilizes only a single GPU on a machine and is known not to deliver the best performance when serving a model with high parallelism.

We recommend considering alternative model backends (e.g., tensorrt-llm, vllm) for a specific model when deploying at such a scale.

P.S.: For enterprise customers, we do offer assistance in exploring the best practices for a specific model/GPU setup. If purchasing a license is an option, please consider booking our office hours at https://calendly.com/tabby_ml/chat-with-tabbyml

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consultation on code completion performance optimization issues. #2939

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Consultation on code completion performance optimization issues. #2939

5bug Aug 5, 2024

Replies: 1 comment

wsxiaoys Aug 5, 2024 Maintainer

5bug
Aug 5, 2024

wsxiaoys
Aug 5, 2024
Maintainer