Model loading OOM when using FSDP + QLoRA #31721

Neo9061 · 2024-07-01T03:08:23Z

System Info

Base line. For single instance of p4de.24xlarge (640GB GPU, 1000 GB CPU), i am able to use Q(4-bit)LoRA to train a large model wit size close to 300B. Device_map is set as auto with code as below.

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

However, when I use FSDP + QLoRA with 2 p4de.24xlarge instances. Model loading went OOM on CPU.

The script is based on this blog and run_fsdp_qlora.py.

Can anyone please share some insights? Looking at the from_pretained method's code here and here. Can I get clarification on the following questions? Many thanks.

For FSDP + QLoRA, during model loading, Please comment if my understanding is correct.

If model is quantized, then the model is loaded on GPU and further casted into CPU, because of is_quantized in this line) and this comment.
If model is not quantized, then the model is directly loaded into CPU.

The OOM happens on CPU as I didn't see any error of "not enough CUDA memory". Thus, for the model that is quantized, when you cast the model into CPU, is only rank 0 doing the job or each of all ranks is casting into CPU leading CPU memory exploding? Same comment for the model that is not quantized during loading.
For quantized model, if you load it firstly into GPU, are you using all GPUs to load the model or using rank 0 to load it?

Who can help?

@SunMarc @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Distributed-finetuning.zip
Here is my code to reproduce the issue

Expected behavior

error free

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-31T08:03:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

matthewdouglas · 2024-07-31T15:08:29Z

This should be fixed with #32276. Related: #31577

ArthurZucker · 2024-08-27T12:53:38Z

Not stale but the PR was reverted!

amyeroberts added PyTorch FSDP Accelerate PEFT Quantization labels Jul 1, 2024

huggingface deleted a comment from github-actions bot Aug 27, 2024

matthewdouglas mentioned this issue Aug 27, 2024

Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading #33154

Merged

5 tasks

ArthurZucker closed this as completed in #33154 Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model loading OOM when using FSDP + QLoRA #31721

Model loading OOM when using FSDP + QLoRA #31721

Neo9061 commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 31, 2024

matthewdouglas commented Jul 31, 2024

ArthurZucker commented Aug 27, 2024

Model loading OOM when using FSDP + QLoRA #31721

Model loading OOM when using FSDP + QLoRA #31721

Comments

Neo9061 commented Jul 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

github-actions bot commented Jul 31, 2024

matthewdouglas commented Jul 31, 2024

ArthurZucker commented Aug 27, 2024

Neo9061 commented Jul 1, 2024 •

edited

Loading