Loss does not converge with FSDP cpu offloading #360

hjlee1371 · 2024-01-29T11:55:24Z

System Info

pytorch==2.2.0
transformers==4.36.2
8 A100 80GB gpus

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

When using FSDP cpu offload with --fsdp_config.fsdp_cpu_offload, the loss fails to converge, unlike when offloading is disabled.

My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned llama-2-7b-hf with (default) samsum datset. May related to similar issues such as this

torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py \
  --enable_fsdp \
  --fsdp_config.fsdp_cpu_offload \ # remove this when training without offloading
  --model_name $MODEL_PATH \
  --save_metrics \
  --batch_size_training 4 \
  --gradient_accumulation_steps 16 \
  --batching_strategy padding \
  --flop_counter False \
  --profiler False \
  --output_dir model_logs \ # path to save metrics
  --dist_checkpoint_root_folder model_checkpoints \
  --dist_checkpoint_folder fine-tuned

Error logs

No errors

Expected behavior

Training w/ and w/o cpu offloading should give same results.

The text was updated successfully, but these errors were encountered:

HamidShojanazeri · 2024-02-08T00:52:57Z

thanks @hjlee1371 for brining this to attention, I believe I can somewhat repro your issue .

CPU offload
avg_train_prep, Value: 1.0844144423802693
avg_train_loss, Value: 0.08095582574605942

No CPU offload

avg_train_prep, Value: 1.0553292433420818
avg_train_loss, Value: 0.05351858213543892

I am checking with FSDP team on this and will keep you posted.

stgzr · 2024-07-03T06:25:37Z

I suppose there is a bug in the gradient accumulation implementation.
If the model is wrapped by a DistributedDataParallel module, when calling backward, the gradient should be averaged across GPUs. So, it seems the gradient is synchronized during each gradient accumulation step, it is a little weird.

HamidShojanazeri self-assigned this Feb 8, 2024

HamidShojanazeri added the triaged label Feb 8, 2024

HamidShojanazeri mentioned this issue Feb 8, 2024

FSDP convergence issue with CPU offload pytorch/pytorch#119431

Open

init27 closed this as completed Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss does not converge with FSDP cpu offloading #360

Loss does not converge with FSDP cpu offloading #360

hjlee1371 commented Jan 29, 2024 •

edited

Loading

HamidShojanazeri commented Feb 8, 2024

stgzr commented Jul 3, 2024

Loss does not converge with FSDP cpu offloading #360

Loss does not converge with FSDP cpu offloading #360

Comments

hjlee1371 commented Jan 29, 2024 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

HamidShojanazeri commented Feb 8, 2024

stgzr commented Jul 3, 2024

hjlee1371 commented Jan 29, 2024 •

edited

Loading