Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wired CUDA memory utilization #16

Open
lwmlyy opened this issue Aug 24, 2023 · 5 comments
Open

Wired CUDA memory utilization #16

lwmlyy opened this issue Aug 24, 2023 · 5 comments

Comments

@lwmlyy
Copy link

lwmlyy commented Aug 24, 2023

Hi, I am using the python launch to lora-finetune Llama2-70b, the training is doing good. But, it seems a bit wired that the memory utilization is quite low, less than 18G. Also, the training speed is relatively slow compared to the codebase in llama-recipes.

The command is:
image

The gpu status during training is:
image

@arielnlee
Copy link
Owner

Hi! Thanks for your interest. Have you tried accelerate? That worked for us! The python way also works, but is very slow. Definitely try accelerate, but if you don’t want to I’d at least switch to 4 A100 80gb GPUs.

@lwmlyy
Copy link
Author

lwmlyy commented Aug 24, 2023 via email

@arielnlee
Copy link
Owner

First run accelerate config to set up accelerate and then replace python finetune.py with accelerate launch finetune.py. If that doesn't work, I'll be happy to get you a script.

To clarify, running python finetune.py will not run as quickly on 4 vs 8 GPUs but when we tried it the native python way, 8 GPUS seemed a bit of a waste, since, as you noticed, utilization isn't great.

@lwmlyy
Copy link
Author

lwmlyy commented Aug 25, 2023

First run accelerate config to set up accelerate and then replace python finetune.py with accelerate launch finetune.py. If that doesn't work, I'll be happy to get you a script.

To clarify, running python finetune.py will not run as quickly on 4 vs 8 GPUs but when we tried it the native python way, 8 GPUS seemed a bit of a waste, since, as you noticed, utilization isn't great.

I just tried running the script with accelerate launch(8*a100-80gb), but it went CUDA OOM during model loading. Any advice?

The accelerate config is as follow:
image

The launch config is as follow:
image

@moon-fall
Copy link

moon-fall commented Oct 8, 2023

same problem. I solve this by reinstall the python package with the version in requirement.txt,i think this is relate with the peft package.
but after that still CUDA memory when the cutoff_len is bigger than 1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants