Performance issues with high level API

After noticing a big, visibly noticeable slowdown in the ooba text ui compared to llama.cpp, I wrote a test script to profile llama-cpp-python's high level API:

```
from llama_cpp import Llama
llm = Llama(model_path="/home/alpha/Storage/AIModels/textui/metharme-7b-4bit-ggml-q4_1/ggml-model-q4_1.bin", n_gpu_layers=31, n_threads=8)
output = llm("""test prompt goes here""", max_tokens=300, stop=[], echo=True)
print(output)
```

And at first glance, everything looks fine, with differences within an margin of error: 

llama-cpp-python test script:
```
llama_print_timings:      sample time =    31.18 ms /    89 runs   (    0.35 ms per token)
llama_print_timings: prompt eval time =   689.20 ms /    52 tokens (   13.25 ms per token)
llama_print_timings:        eval time =  5664.22 ms /    88 runs   (   64.37 ms per token)
```
llama.cpp ./main:
```
llama_print_timings:      sample time =    95.47 ms /   145 runs   (    0.66 ms per token)
llama_print_timings: prompt eval time =   686.61 ms /    53 tokens (   12.95 ms per token)
llama_print_timings:        eval time =  9823.06 ms /   144 runs   (   68.22 ms per token)
```

So I used Nvidia nsys to profile the generation with `sudo nsys profile --gpu-metrics-device=0 python perftest.py` and then examine the generated reports with ` ncu-ui`

Here is a snapshot of llama.cpp's utilization:
![cpp2](https://github.com/abetlen/llama-cpp-python/assets/46462706/7687e16d-16b9-4ae5-8f89-29e35ebdbf95)

The CPU is fully saturated without interruption. The GPU is not being fully utilized, but is pretty consistently loaded as is to be expected. 

Now, here is the current git commit of llama-cpp-python: 
![cpppython2](https://github.com/abetlen/llama-cpp-python/assets/46462706/b7d4e1e4-3044-46d4-a414-19966d50c559)

Seems there are long pauses where the only thread doing any work is the single python thread:
![Screenshot_14](https://github.com/abetlen/llama-cpp-python/assets/46462706/6f4c9d35-7107-4df5-a121-46acfc1bd2a4)

@Firstbober seems to have discovered that the low level API is faster than the high level API: https://github.com/abetlen/llama-cpp-python/issues/181

And @eiery seems to think that this issue predates the cuda builds, though their token/s measurements don't line up with mine: https://github.com/oobabooga/text-generation-webui/issues/2088#issuecomment-1548872664


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issues with high level API #232

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance issues with high level API #232

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions