Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss does not converge with FSDP cpu offloading #360

Closed
1 of 2 tasks
hjlee1371 opened this issue Jan 29, 2024 · 2 comments
Closed
1 of 2 tasks

Loss does not converge with FSDP cpu offloading #360

hjlee1371 opened this issue Jan 29, 2024 · 2 comments
Assignees
Labels

Comments

@hjlee1371
Copy link

hjlee1371 commented Jan 29, 2024

System Info

pytorch==2.2.0
transformers==4.36.2
8 A100 80GB gpus

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

When using FSDP cpu offload with --fsdp_config.fsdp_cpu_offload, the loss fails to converge, unlike when offloading is disabled.
image

My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned llama-2-7b-hf with (default) samsum datset. May related to similar issues such as this

torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py \
  --enable_fsdp \
  --fsdp_config.fsdp_cpu_offload \ # remove this when training without offloading
  --model_name $MODEL_PATH \
  --save_metrics \
  --batch_size_training 4 \
  --gradient_accumulation_steps 16 \
  --batching_strategy padding \
  --flop_counter False \
  --profiler False \
  --output_dir model_logs \ # path to save metrics
  --dist_checkpoint_root_folder model_checkpoints \
  --dist_checkpoint_folder fine-tuned

Error logs

No errors

Expected behavior

Training w/ and w/o cpu offloading should give same results.

@HamidShojanazeri
Copy link
Contributor

thanks @hjlee1371 for brining this to attention, I believe I can somewhat repro your issue .

CPU offload
avg_train_prep, Value: 1.0844144423802693
avg_train_loss, Value: 0.08095582574605942

No CPU offload

avg_train_prep, Value: 1.0553292433420818
avg_train_loss, Value: 0.05351858213543892

I am checking with FSDP team on this and will keep you posted.

@stgzr
Copy link

stgzr commented Jul 3, 2024

I suppose there is a bug in the gradient accumulation implementation.
If the model is wrapped by a DistributedDataParallel module, when calling backward, the gradient should be averaged across GPUs. So, it seems the gradient is synchronized during each gradient accumulation step, it is a little weird.

@init27 init27 closed this as completed Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants