-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Performance issues with high level API #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also, I can upload these profile files if anyone wants to take a peek for themselves, just ask. 👍 The next step is to profile the low level API and profile both APIs with py spy, but I don't have either set up yet. |
@AlphaAtlas thank you for the detailed report, aside from py-spy it may also be useful to profile memory allocation / usage. My guess is that it's to do with memory allocations, that's one area I've probably been a little too lazy / unprincipled. I would guess it has to do with allocating / freeing new candidates arrays between calls to the llama.cpp sampling functions, this should really only be done once. |
@AlphaAtlas If you compare timings between llama.cpp and llama-cpp-python with CPU only do you see a difference? |
I did some tests with your script as that's a better way to test the lib itself directly. Here are my results; I modified the script to run on 4 threads as that's what my computer can do. I also used a longer prompt to test both ingestion and generation. All tests were done on 7B q4_0 with OpenBLAS. original llama.cpp
llama-cpp-python with script
I ran both of these examples multiple times and the results above are representative of what I saw on average. What's interesting is that llama-cpp-python is slightly faster in generation, possibly due to lack of streaming. With the webui, eval time is still in the order of ~350ms a token. |
Here are webui results with streaming turned on and off as a comparison. Without streaming webui results are matching what I'm seeing with @AlphaAtlas's script. Streaming on:
Streaming off:
|
I think you have to take the results with a grain of salt, as the timings for me were almost exactly the same even though llama-cpp-python was visibly slower. Maybe llama.cpp is "sleeping" some of the time and hence that time is not part of its performance metrics? Not sure about the llama.cpp metrics difference... Maybe Python is eating some CPU time doing something, which would be more visible on a 4 thread machine. |
@eiery @AlphaAtlas the the llama.cpp timings are only the portion of time spent in the llama.cpp eval and sample library functions not the rest of the program. The flamegraph would show you a better indication of total times but could be susceptible to errors from the sampling method. |
Oh and here is the full python profile. You can open it with https://profiler.firefox.com. |
@AlphaAtlas based on your latest screenshot Quite a bit of time is being spent on a list comprehension inside of the sampling methods. I've pushed a commit that replaces this list comprehension with a simple assignment to a data structure I create once when the Llama object is initialized, I've profile the change and it does seem to reduce sampling time in practice. Let me know if it's any better for you. |
Also, here are the full steps for testing (on linux) in case anyone wants to try this out.
You can just use functiontrace and a CPU graph if you don't use the Nvidia profiler. |
@AlphaAtlas I think I've narrowed it down now, it looks like there's about 20ms pause per sample where it's copying the last logits into the candidates token_data_array. Other than that it looks like the majority of time is spent inside the c++ functions in the shared library. I'll work on trying to remove this delay though it's a little challenging as I can't just memcpy the logits because the candidate data is an array of structs. Here's the line-by-line profiling I got by using
|
I think the solution is to just move to using numpy, it introduces an additional dependency but it should reduce memory usage and speed up a few sections like this. |
Chances are high, whoever uses this package has numpy installed anyway in their environment. |
@AlphaAtlas got the numpy implementation working and it seems to improve the performance as excpected, the PR is still open #277 here but I should be merging it soon
|
Any idea when this might be merged to a new release build? |
@brandonj60 just merged and published to v0.1.56 |
@AlphaAtlas do you mind testing with the latest version? There will still be some gpu utilization drop (sampling is not gpu accelerated afaik) but it should generally be faster. |
Interestingly, the new numpy version breaks functiontrace:
The gaps in the Nvidia profile are still there, but very small now 👍 I do see the sampling GPU usage drops... maybe it doesn't matter? I will just time llama.cpp vs llama-cpp-python generating a set number of tokens. |
I timed fresh builds on the llama-cpp-python script and a llama.cpp They are both consistently ~28 seconds, with a half second spread or so... Unless I find some other evidence llama-cpp-python is slower, I think this issue is thoroughly fixed. 🎉 Thanks 👍 |
After noticing a big, visibly noticeable slowdown in the ooba text ui compared to llama.cpp, I wrote a test script to profile llama-cpp-python's high level API:
And at first glance, everything looks fine, with differences within an margin of error:
llama-cpp-python test script:
llama.cpp ./main:
So I used Nvidia nsys to profile the generation with
sudo nsys profile --gpu-metrics-device=0 python perftest.py
and then examine the generated reports withncu-ui
Here is a snapshot of llama.cpp's utilization:

The CPU is fully saturated without interruption. The GPU is not being fully utilized, but is pretty consistently loaded as is to be expected.
Now, here is the current git commit of llama-cpp-python:

Seems there are long pauses where the only thread doing any work is the single python thread:

@Firstbober seems to have discovered that the low level API is faster than the high level API: #181
And @eiery seems to think that this issue predates the cuda builds, though their token/s measurements don't line up with mine: oobabooga/text-generation-webui#2088 (comment)
The text was updated successfully, but these errors were encountered: