-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model loading OOM when using FSDP + QLoRA #31721
Labels
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Not stale but the PR was reverted! |
5 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System Info
Base line. For single instance of p4de.24xlarge (640GB GPU, 1000 GB CPU), i am able to use Q(4-bit)LoRA to train a large model wit size close to 300B.
Device_map
is set asauto
with code as below.However, when I use FSDP + QLoRA with 2 p4de.24xlarge instances. Model loading went OOM on CPU.
Can anyone please share some insights? Looking at the
from_pretained
method's code here and here. Can I get clarification on the following questions? Many thanks.is_quantized
in this line) and this comment.The OOM happens on CPU as I didn't see any error of "not enough CUDA memory". Thus, for the model that is quantized, when you cast the model into CPU, is only rank 0 doing the job or each of all ranks is casting into CPU leading CPU memory exploding? Same comment for the model that is not quantized during loading.
For quantized model, if you load it firstly into GPU, are you using all GPUs to load the model or using rank 0 to load it?
Who can help?
@SunMarc @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Distributed-finetuning.zip
Here is my code to reproduce the issue
Expected behavior
error free
The text was updated successfully, but these errors were encountered: