You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using FSDP cpu offload with --fsdp_config.fsdp_cpu_offload, the loss fails to converge, unlike when offloading is disabled.
My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned llama-2-7b-hf with (default) samsum datset. May related to similar issues such as this
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py \
--enable_fsdp \
--fsdp_config.fsdp_cpu_offload \ # remove this when training without offloading
--model_name $MODEL_PATH \
--save_metrics \
--batch_size_training 4 \
--gradient_accumulation_steps 16 \
--batching_strategy padding \
--flop_counter False \
--profiler False \
--output_dir model_logs \ # path to save metrics
--dist_checkpoint_root_folder model_checkpoints \
--dist_checkpoint_folder fine-tuned
Error logs
No errors
Expected behavior
Training w/ and w/o cpu offloading should give same results.
The text was updated successfully, but these errors were encountered:
I suppose there is a bug in the gradient accumulation implementation.
If the model is wrapped by a DistributedDataParallel module, when calling backward, the gradient should be averaged across GPUs. So, it seems the gradient is synchronized during each gradient accumulation step, it is a little weird.
System Info
pytorch==2.2.0
transformers==4.36.2
8 A100 80GB gpus
Information
🐛 Describe the bug
When using FSDP cpu offload with

--fsdp_config.fsdp_cpu_offload
, the loss fails to converge, unlike when offloading is disabled.My experiments are based on a forked codebase with minimal modification (https://github.com/hjlee1371/llama-recipes) through following scripts. I finetuned
llama-2-7b-hf
with (default) samsum datset. May related to similar issues such as thisError logs
No errors
Expected behavior
Training w/ and w/o cpu offloading should give same results.
The text was updated successfully, but these errors were encountered: