-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Open
Labels
Description
Describe the bug
When using BF16 to train CIFAR10 following deepspeedai/DeepSpeedExamples#651. I encounter accuracy loss with the following conditions:
- BF16 training
- zero stage 0 and 1
- world_size >= 2
To Reproduce
Steps to reproduce the behavior:
- Use PR Enable non-CUDA device for CIFAR10 and HelloDeepSpeed training example DeepSpeedExamples#651
- Goto DeepSpeedExamples/training/cifar/
- Change ds_config.json to change 'fp16' to 'bf16'
- run the
run_ds.shwith multiple accelerators on the system - Observe accuracy loss with multiple ranks. For 2 ranks, accuracy is 52%, much less than 57% of 1 rank. For 8 ranks, accuracy drops to around 43%
- This can be observed from two CUDA cards. The 8 rank result is observed from a WIP CPU training branch. Should also be observed on 8 CUDA card system
- When change data type to fp32 (on CPU system), there is no issue with 2 ranks or 8 ranks. When change to zero stage 2, there is no accuracy issue.
Expected behavior
BF16 train accuracy the same as single rank.
ds_report output
[2023-07-18 12:16:47,983] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2023-07-18 12:16:48,129] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/gma/anaconda3/envs/dscpu/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/gma/DeepSpeed/deepspeed']
deepspeed info ................... 0.9.4+046afced, 046afced, master
deepspeed wheel compiled w. ...... torch 2.0
Screenshots
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
GroundTruth: cat ship ship plane
GroundTruth: cat ship ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
Predicted: cat plane ship plane
GroundTruth: cat ship ship planePredicted:
cat plane ship plane
GroundTruth: cat ship ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
Predicted: cat plane ship plane
Predicted: cat plane ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
System info (please complete the following information):
- OS:
6.4.0-rc2-2023-05-17-intel-next+ #1 SMP PREEMPT_DYNAMIC Wed May 17 15:36:48 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux - 1 machine with 2 RTX3090 cards / 1 machine with 2 SPR 48 core configured with SNC4
- Python version: 3.11.3
Launcher context
With DeepSpeed launcher
Docker context
No.
Reactions are currently unavailable