-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: bench: continuous performance testing #6233
Comments
@ggerganov @ngxson @slaren appreciate your early feedback on the approach before I start implementing too much |
This is honestly so cool, I think it'd be a very worthwhile investment to track performance changes for a small set of select hardware over time. I think we'll be seeing that some small changes affect performance in unexpected ways (both positive and negative) Only one thing I am wondering right now, do these servers run on some kind of shared hardware? It's incredibly important that everything on the system is in the exact same clean slate whenever a test is runned. For example, if it's on shared hardware it's possible certain caches are unoptimal, whereas in the opposite case if the same hardware is ran 5x in a row, will the second run be a lot faster due to all sorts of arcane kernel caches, filesystem, ssd, driver caches etc? I believe I saw a presentation by a C++ benchmarking expert that they'd developed a script that can reset all this arcane and hidden shared state/caches affecting benchmarking in one go. I'll go look and see if I can find it. |
Looks good, it would be nice to have other parameters in the matrix such as different values of |
All tests will be running on dedicated Azure nodes (thanks @aigrant) that will do just this benchmark. We are starting with a single T4 node and if this works out, we will add more |
Cool idea, it will be very useful to keep track of llama.cpp's performance compared to "pure" GPU alternative like TensorRT or exllama.
One thing I think we need to consider though: the proposal here seems to based on the idea of having a "manager" machine and a "runner" machine - this will not be the case when using self-hosted runner. You can imagine that github will simply send out SSH commands to self-hosted runner, so there will be only one machine evolved. Because of that, prometheus may not really fit the usage (because everything run on the same machine). Also I think we can firstly start with something more basic than prometheus, for example just a simple python script that collect metrics each X seconds. My idea here is to prevent have an external dependency from the beginning - we should add it in the way when we feel absolutely needed. Personally, on my company, I often have to work with self-hosted gitlab and self-hosted runners, so I think I can help on setup scripts if needed. Figuring out how to properly setup the self-hosted runner is also a big task to do I think, let's focus more on that for now. |
Yes I understand, but is it a bare metal server that is completely isolated? Or is it sharing resources on one huge server? Either way, it doesn't matter much regardless what exactly it's running on. My point is that any hidden arcane state needs to be reset before running any benchmark script. |
Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also the case with other GPU like A100 or A10G, but not sure if it's also the same with H100 or not. |
On my company we have gitlab runners that plugged into docker on each machine, so in the end each CI run is isolated (multiple CI can even run in parallel). Even when the CI is failed for some reason, the resource is automatically clean up by docker. I believe that Github runner "agent" will have the same function. Edit: but yeah sometime it's better to simply reset the machine (maybe via a snapshot) especially when working with benchmark. We can look into this in the future. |
Yes, AFAIK NVidia GPU virtualization does not exists on Azure (yet?), it is possible to fraction them only, but this is not our case. There is a solution from the vendor and I also have some good feedback with fractional GPU sharing for Kubernetes of run.ai. @Azeirah In this proposition, all layers will be offloaded to GPU, only one test at a time per runner, so I believe we will not suffer too much with the hypervisor throttling. |
@ggerganov We need to keep this in mind:
see Self-hosted runner security So by design we will be using just-in-time runners and ideally the workflow should be started only by Be sure I will test all this on my private fork first. EDIT: solution proposed in the summary |
@ggerganov what about the defragmentation target for the baseline, without, I see lot of: With |
The thold should be 5-10% (e.g. If you are getting that error, it means your |
First workflow ready to receive feedback:
Based on this, we can modify duration, all parameters, comment template, frequency, etc... If you agree with the approach, I can later on continue to add models or embeddings. |
Hello everyone, The workflow is deployed since one week, and some concerns have been identified:
@Azeirah @ngxson Any idea what can cause the discruptencies ? finally maybe the virtualization has an impact on @ggerganov In which direction do you want we go further ? add an A100 test :) ? add embeddings ? other models MOE like ? Thanks for your feedback |
Regarding the PR comment with benchmark information: I find it a little bit distracting since it pops up in all PRs even unrelated to speed. I think it would be better to implement the long-term plot that you suggested at some point where we would be able to see the performance metrics as a function with time Variations: are we using the same random seed for all runs? AFAICT from We should add F16 and Q8_0 benchmarks for Phi-2 |
Seems interesting. I’m currently limited to working from mobile phone, so can’t have a look right now. I’ll try when I can |
@ggerganov the node seems to be down. Maybe we should configure the runner as a service ? |
Hm, not sure why it was down - restarted it again. A service could be useful |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Motivation
llama.cpp is under active development, new papers on LLM are implemented quickly (for the good) and backend device
optimizations are continuously added.
All these factors have an impact on the server performances, especially the following metrics:
It is important to monitor and control the impact of the codebase evolution on these metrics,
example from:
Since #5941, we have a server bench framework, we can now trigger it based on different events:
The approach should be reproducible: use the same hardware architecture, same models size and quants.
It would be nice to follow performances changes on a time series graph like it is done
in Apache Lucene.
Proposed approach
Bench will run on a T4 GPU node in Azure
Cloud, so:
On
a GitHub self-hosted runners
with prometheus installed.
A GitHub workflow, will:
Release
build type andLLAMA_CUDA
withnative
CUDA architectureTechnical consideration
One important aspect of this configuration would be to make it easy to add more nodes in the future.
If we see that it works and is useful, we can find ways to add more hardware in order to do metrics for different cases.
All the code used must be stored in
examples/server/bench
folder.GitHub Self-Hosted runner security
Self-hosted runner security:
By design, we will be using just-in-time runners:
As the GitHub checks can only be run by collaborators, the job is running in a non-root docker container, I think we are safe.
Server scenario parameters matrix
In addition, following parameters will be used:
--log-disable
no need to have a log file--metrics
to allow prometheus metrics scrapping--cont-batching
, probably need to enable by default server: enable --cont-batching by default #6229--threads 1
, we will test only with all layers offloaded to GPU--threads-batch 1
, we will test only with all layers offloaded to GPU--model ggml-model.gguf
as now we can download anything from HF--defrag-thold 0.1
Only the OAI Chat completions endpoint with streaming enabled will be tested for completions.
Dataset consideration
in bench/README.md:
SERVER_BENCH_N_PROMPTS
total prompts to select in the benchmarkSERVER_BENCH_MAX_PROMPT_TOKENS
maximum prompt tokens to filter out in the datasetSERVER_BENCH_MAX_CONTEXT
maximum context size of the completions request to filter out in the dataset: prompt +predicted tokens
Selected dataset:
Tasks
was not so easy actually
alsa-utils
in order to prevent:could not open aplay -l
during installationinstall-docker.sh
in ggml-ci: ci: add install-docker.sh ggml-org/ci#1--ubatch-size
option in the README: server: docs:--threads
and--threads
,--ubatch-size
,--log-disable
#6254--defrag-thold
option #6293The text was updated successfully, but these errors were encountered: