-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Hi Deepspeed team,
I run DeepSpeedExamples/BingBertSquad on my machine with 2 GPUs. I follow the instruction https://www.deepspeed.ai/tutorials/bert-finetuning/ and can get a reproduce when I run the run_squad_baseline.sh.
However, when I changed the deepspeed_bsz24_config.json file, it gave me the following warning and I could only get 'loss=nan'. Besides, if I used the original config file, it gave me the same result.
[INFO] [deepspeed_utils.py:118:_handle_overflow] rank 0 detected overflow nan in tensor 0:0 shape torch.Size([30528, 1024]) | 3/29324 [00:00<2:33:39, 3.18it/s]
[2020-08-20 14:38:13,808] [INFO] [zero_optimizer_stage1.py:621:step] [deepspeed] OVERFLOW! Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
The config file is like this:
{
"train_batch_size": 12,
"train_micro_batch_size_per_gpu": 3,
"steps_per_print": 10,
"optimizer": {
"type": "Adam",
"params": {
"lr": 3e-5,
"weight_decay": 0.0,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 1
}
}
Could you help me fix it?
Thanks!
Tony