Using llama.cpp with AWS instances #4225
Replies: 11 comments 11 replies
-
Thank you! |
Beta Was this translation helpful? Give feedback.
-
this is gold. thank you. |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing! Here's a side quest for those of you using llama.cpp via Python bindings and CUDA. This is a minimalistic example of a Docker container you can deploy in smaller cloud providers like VastAI or similar. To build the image: Model: https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF Dockerfile_llamacppYou can change CUDA's version number as required. Literally, just change the number on line 1 to a valid version compatible with your hardware.
app.py
When you run it you should see something like this at the very top (if verbose is True): |
Beta Was this translation helpful? Give feedback.
-
I created a example of doing this in cloudformation - although I geared my example for CPU only in order to keep total costs down https://github.com/openmarmot/aws-cft-llama-cpp The template creates a EC2 instance, installs llama.cpp and runs the server, and attaches an IAM role with the permissions necessary to enable the AWS web console (Connect) for linux console access so you don't have to SSH to the instance to connect |
Beta Was this translation helpful? Give feedback.
-
A word of caution on Amazon Cloud: If you run something productive there you'll find moving out difficult as everything needs to adapt to their cloud. Not saying to stay away, it's a good way to get started and quite cheap when using at low scale or only for a couple hours. |
Beta Was this translation helpful? Give feedback.
-
Awesome, thanks! |
Beta Was this translation helpful? Give feedback.
-
Did someone tried llama.cpp on the m6g/t4g? (cpu-only, ARM; t4g - burstable) Though it is 5 times cheaper, CPU-inference should be many times slower, but Nvidia T4 is not a very powerfull GPU, so it still can make sense to compare. (beware, so many different t/g-4's) |
Beta Was this translation helpful? Give feedback.
-
I am unable to run quantize in the last two lines. I downloaded the repo on 5th December 2023. Could you please guide me in executing the code? I am facing the following error: ./quantize ./models/openhermes-7b-v2.5/ggml-model-f16.gguf ./models/openhermes-7b-v2.5/ggml-model-q8_0.gguf q8_0 |
Beta Was this translation helpful? Give feedback.
-
Hi there. I wanted to clarify a couple of things about this tutorial.
running the executables fails with:
TL;DR: Is this tutorial omitting some steps? I'm having some trouble getting it to work with a pretty much identical configuration. |
Beta Was this translation helpful? Give feedback.
-
I've gone through the process of getting the Nvidia drivers installed silently on the P3 series if anyone is still looking for help on this note - this is a aws cloudformation template that can be run from the aws cloudformation console. You can also just grab the linux shell commands from the userdata area and do it yourself manually if you want. The Nvidia linux drivers are ultra picky. This template is designed for Amazon Linux 2 and a P3 series - it will likely need modification if you are using a different instance or instance type. https://github.com/openmarmot/aws-ec2-nvidia-drivers/blob/main/cft-al2-p3-series.yaml |
Beta Was this translation helpful? Give feedback.
-
How do you come up with the KV cache size? (1280 MiB) |
Beta Was this translation helpful? Give feedback.
-
Description
The
llama.cpp
project offers unique ways of utilizing cloud computing resources. Here we will demonstrate how to deploy allama.cpp
server on a AWS instance for serving quantum and full-precision F16 models to multiple clients efficiently.Select an instance
Go to AWS instance listings: https://aws.amazon.com/ec2/pricing/on-demand/
Sort by price and find the cheapest one with NVIDIA GPU:
Check the specs:
The
g4dn.xlarge
instance has 1x T4 Tensor Core GPU with 16GB VRAM. Here are the NVIDIA specs for convenience:Click me
Start the instance and login over SSH
Also, make sure to enable inbound connections to port 8888 - we will need it later for the HTTP server
Select a model and prepare
llama.cpp
We have just 16GB VRAM to work with, so we likely want to choose a 7B model. Lately, the OpenHermes-2.5-Mistral-7B model is getting some traction so let's go with it.
We will clone the latest
llama.cpp
repo, download the model and convert it to GGUF format:Do some performance benchmarks
The T4 GPUs have just 320GB/s memory bandwidth, so we cannot expect huge tok/s numbers, but let's work with what we have.
We want to be serving requests in parallel, so we have to have an idea about the types of queries that we are going to be processing in order to setup some limits. Let's do the following assumptions:
We assume that at any moment in time, there will be a maximum of 4 queries being processed in parallel. Each query can have a maximum individual prompt of 2048 tokens and each query can generate a maximum of 512 tokens. So in order to support this scenario, we need to have a KV cache of size
4*(2048 + 512) = 10240 tokens (1280 MiB, F16)
.Let's benchmark stock
llama.cpp
using the F16 model:Legend
We immediately notice that there is not enough VRAM to load both the F16 model and the 10240 tokens KV cache. This means the maximum clients we can serve in this case is just 1. The TG speed is also not great as we expected: ~16 t/s.
Let's for a moment relax the requirements and say that the max prompt size would be
512
instead of2048
. This scenario now fits in the available VRAM and here are the results:llama.cpp
supports efficient quantization formats. By using a quantum model, we can reduce the base VRAM required to store the model in memory and thus free some VRAM for a bigger KV cache. This will allow us to serve more clients with the original prompt size of2048
tokens. Let's repeat the same benchmark usingQ8_0
andQ4_K
quantum models:Using the quantum models and a KV cache of size
4*(2048 + 512) == 10240
we can now successfully serve 4 clients in parallel and have plenty of VRAM left. The prompt processing speed is not as good as F16, but the text generation is better or similar.Note that
llama.cpp
supports continuous batching and sharing a common prompt. A sample implementation is demonstrated in the parallel.cpp example. Here is a sample run with theQ4_K
quantum model, simulating 4 clients in parallel, asking short questions with a shared assistant prompt of 300 tokens, for a total of 64 requests:LLAMA_CUBLAS=1 make -j parallel && ./parallel -m ./models/openhermes-7b-v2.5/ggml-model-f16.gguf -n -1 -c 4096 --cont_batching --parallel 4 --sequences 64 --n-gpu-layers 99 -s 1
Results from `parallel`
Running a demo HTTP server
The
llama.cpp
server example can be build and started like this:An alternative way for really quick deployment of
llama.cpp
for demo purposes is to use the server-llm.sh helper script:bash -c "$(curl -s https://ggml.ai/server-llm.sh)"
For more info, see: #3868
Final notes
This was a short walkthrough of how to setup and bench
llama.cpp
in the cloud that I hope would be useful for people looking for a simple and efficient LLM solution. There are many details not covered here and one needs to understand some of the intricate details of thellama.cpp
andggml
implementations in order to take full advantage of the available compute resources. Knowing when to use a quantum model vs F16 model for example requires understanding of the existing CUDA kernels and their limitations. The code base is still relatively simple and allows to easily customize the implementation according to the specific needs of a project. Such customizations can yield significant performance gains compared to the stockllama.cpp
implementation that is available out-of-the-box frommaster
.Beta Was this translation helpful? Give feedback.
All reactions