Skip to content

Performance issues with high level API #232

Closed
@AlphaAtlas

Description

@AlphaAtlas

After noticing a big, visibly noticeable slowdown in the ooba text ui compared to llama.cpp, I wrote a test script to profile llama-cpp-python's high level API:

from llama_cpp import Llama
llm = Llama(model_path="/home/alpha/Storage/AIModels/textui/metharme-7b-4bit-ggml-q4_1/ggml-model-q4_1.bin", n_gpu_layers=31, n_threads=8)
output = llm("""test prompt goes here""", max_tokens=300, stop=[], echo=True)
print(output)

And at first glance, everything looks fine, with differences within an margin of error:

llama-cpp-python test script:

llama_print_timings:      sample time =    31.18 ms /    89 runs   (    0.35 ms per token)
llama_print_timings: prompt eval time =   689.20 ms /    52 tokens (   13.25 ms per token)
llama_print_timings:        eval time =  5664.22 ms /    88 runs   (   64.37 ms per token)

llama.cpp ./main:

llama_print_timings:      sample time =    95.47 ms /   145 runs   (    0.66 ms per token)
llama_print_timings: prompt eval time =   686.61 ms /    53 tokens (   12.95 ms per token)
llama_print_timings:        eval time =  9823.06 ms /   144 runs   (   68.22 ms per token)

So I used Nvidia nsys to profile the generation with sudo nsys profile --gpu-metrics-device=0 python perftest.py and then examine the generated reports with ncu-ui

Here is a snapshot of llama.cpp's utilization:
cpp2

The CPU is fully saturated without interruption. The GPU is not being fully utilized, but is pretty consistently loaded as is to be expected.

Now, here is the current git commit of llama-cpp-python:
cpppython2

Seems there are long pauses where the only thread doing any work is the single python thread:
Screenshot_14

@Firstbober seems to have discovered that the low level API is faster than the high level API: #181

And @eiery seems to think that this issue predates the cuda builds, though their token/s measurements don't line up with mine: oobabooga/text-generation-webui#2088 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions