Description
After noticing a big, visibly noticeable slowdown in the ooba text ui compared to llama.cpp, I wrote a test script to profile llama-cpp-python's high level API:
from llama_cpp import Llama
llm = Llama(model_path="/home/alpha/Storage/AIModels/textui/metharme-7b-4bit-ggml-q4_1/ggml-model-q4_1.bin", n_gpu_layers=31, n_threads=8)
output = llm("""test prompt goes here""", max_tokens=300, stop=[], echo=True)
print(output)
And at first glance, everything looks fine, with differences within an margin of error:
llama-cpp-python test script:
llama_print_timings: sample time = 31.18 ms / 89 runs ( 0.35 ms per token)
llama_print_timings: prompt eval time = 689.20 ms / 52 tokens ( 13.25 ms per token)
llama_print_timings: eval time = 5664.22 ms / 88 runs ( 64.37 ms per token)
llama.cpp ./main:
llama_print_timings: sample time = 95.47 ms / 145 runs ( 0.66 ms per token)
llama_print_timings: prompt eval time = 686.61 ms / 53 tokens ( 12.95 ms per token)
llama_print_timings: eval time = 9823.06 ms / 144 runs ( 68.22 ms per token)
So I used Nvidia nsys to profile the generation with sudo nsys profile --gpu-metrics-device=0 python perftest.py
and then examine the generated reports with ncu-ui
Here is a snapshot of llama.cpp's utilization:
The CPU is fully saturated without interruption. The GPU is not being fully utilized, but is pretty consistently loaded as is to be expected.
Now, here is the current git commit of llama-cpp-python:
Seems there are long pauses where the only thread doing any work is the single python thread:
@Firstbober seems to have discovered that the low level API is faster than the high level API: #181
And @eiery seems to think that this issue predates the cuda builds, though their token/s measurements don't line up with mine: oobabooga/text-generation-webui#2088 (comment)