-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune GPU Utilization fell to 0% #4016
Comments
I recently had a different issue with cuda finetuning on Windows, and it was caused by a regression I had picked up. It was fixed by commit 48ade94, could you confirm you have that commit? |
@AndrewGodfrey Hi, we were using 57ad015. So, we have the fixed commit. |
Ok. I don't think you've hit a bug. Your expectation:
is not yet met by the current code. The current state is captured in this comment by ggerganov, specifically: "when doing finetuning, some of the ops can benefit from the base tensors being already on the GPU although the results from these ops would likely be copied back to the CPU. So this is far from full offloading, but it actually can provide improvement in the speed." I'm curious whether you do see an improvement in speed in this scenario vs only-CPU finetuning. But either answer would not be a big surprise yet. |
@AndrewGodfrey Thank you for your reply. I'll close this. |
@AndrewGodfrey I think the key issue here is, given that we have observed that the GPU utilization has dropped to 0% using the What are your thoughts on this? |
What happens when you don’t use -ngl? |
@AndrewGodfrey Our team would like to work on CUDA acceleration for finetuning since it's not optimized. Would you please point us in the right direction? How can we go about implementing this? |
It works but it is so slow. |
The point of my question, is to get an answer to your question. If you know how to be confident that the cpu run is making progress, what happens when you apply that to the gpu run? |
I don't have that answer - I've been learning about the code myself. There's training-related code that's running on the "CPU" backend, but unknown (to me) is if the bulk of that work is in a small piece that can be easily offloaded, or if it's much more complicated. Also there are bugs in finetune.cpp's logic for calculating buffer sizes (even when running on CPU) and I don't know how much of those would need to be fixed first. Other people who've been on the project longer, might know these answers already. P.S. Note that finetune.cpp is reusing code that's in train.cpp. You would likely be accelerating both use cases at the same time. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
finetune.cpp should fully utilize the GPU. We were trying to finetune Llama2-7b 16 bit model with a sample dataset of 1000 samples.
Current Behavior
When running finetune and offloading all layers to the GPU, the GPU was utilized at around 30% at the beginning of processing each batch and fell to 0% until the batch was finished.
Environment and Context
We use GCP instance of a2-highgpu-1g | 1 GPU | 40 GB HBM2 | 12 vCPUs | 85 GB. It has Nvidia A100 40GB GPU.
Linux instance-1 5.10.0-25-cloud-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64 GNU/Linux
Failure Information
Steps to Reproduce
Build command
cmake .. -DLLAMA_CUBLAS=ON && cmake --build . --config Release -j8
Finetune
./finetune --model-base ../../../Llama-2-7b-32.gguf --lora-out ./loraout.bin --train-data ../train-sample.txt --save-every 1 --threads 12 --adam-iter 30 --batch 4 --ctx 1100 -ngl 35 --no-checkpointing --epochs 3 --checkpoint-out ./check-ITERATION.gguf --sample-start "<s>"
Logs
The text was updated successfully, but these errors were encountered: