Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU NOT used during "normal generation" when ONE LAYER offloaded (But GPU used in prompt evaluation) #3860

Closed
Nate687 opened this issue Oct 30, 2023 · 6 comments
Labels

Comments

@Nate687
Copy link

Nate687 commented Oct 30, 2023

Hello,

I am using a GGUF version of dolphin-2.2-mistral-7B-GGUF (dolphin-2.2-mistral-7b.Q8_0.gguf) and have offloaded ONLY ONE LAYER of the model on the GPU using the "n-gpu-layers" options (since I am using a GPU having 2GB VRAM I can not offload anymore without getting a CUDA out of memory error during inference)

All works well during the prompt evaluation and I can see that the GPU being used (around 80% usage of the GPU is observed)

However the GPU usage immediately drops to 0 and the CPU usage reaches 100% (i.e no GPU usage) during the normal generation of the response (i.e once prompt is evaluated and the response of the model starts being printed out on the screen)

As such my "prompt eval time" is around 35 ms per token .... but the "eval time" is around 500 ms per token

I may be wrong, but I thought that even though only one layer was offloaded to the GPU, the GPU would be used during the normal generation of the response (to help bring down the "eval time")

Any guidance or tips on what I may be doing wrong would be highly appreciated.

Thanks,
Nate

@staviq
Copy link
Contributor

staviq commented Oct 30, 2023

That is pretty much "normal", prompt processing is GPU accelerated if you compiled llamacpp with GPU acceleration, and then when the actual inference happens, offloading just one layer doesn't add much performance benefit, but adds memory transfer overhead to and from the GPU.

Processing is still mostly done layer by layer, so the GPU works on the first layer, forwards the processing back to the CPU, and then the GPU sits idle while the rest of layers is being processed on the CPU.

So basically, you would want to offload enough layers to at least let the GPU make up for the memory transfer overhead with its compute speed, otherwise there are no gains from using the GPU.

With a 2G VRAM GPU, you might actually have better performance without offloading any layers and just having the prompt processing accelerated, but you need to test that as it depends a lot on a particular hardware.

@Dampfinchen
Copy link

Dampfinchen commented Oct 31, 2023

Yes, I'm experiencing this too. Before, I was getting 250 ms/t generation time, now I'm getting 1300 ms/t with 25 layers offloaded running a 13B q4k_s model on my 6 GB RTX 2060. Prompt processing is unaffected.

@Dampfinchen
Copy link

I've fixed my problem by compiling it with AVX2 (it isn't compiled with it by default anymore)

But yeah, with 1 layer the GPU won't do much. That is normal.

@cduk
Copy link
Contributor

cduk commented Mar 14, 2024

That is pretty much "normal", prompt processing is GPU accelerated if you compiled llamacpp with GPU acceleration, and then when the actual inference happens, offloading just one layer doesn't add much performance benefit, but adds memory transfer overhead to and from the GPU.

I want to do just this. Have the prompt processing GPU accelerated then the rest by CPU. Is there anything I have to do to explicitly to enable the GPU acclerated prompt processing? I have layers to GPU as zero.

@cduk
Copy link
Contributor

cduk commented Mar 14, 2024

I've fixed my problem by compiling it with AVX2 (it isn't compiled with it by default anymore)

How do you compile it with AVX2?

@github-actions github-actions bot added the stale label Apr 14, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants