Skip to content

Warning: NaN or Inf found in input tensor when running DeepSpeedExamples/BingBertSquad. #324

@TonyTangYu

Description

@TonyTangYu

Hi Deepspeed team,

I run DeepSpeedExamples/BingBertSquad on my machine with 2 GPUs. I follow the instruction https://www.deepspeed.ai/tutorials/bert-finetuning/ and can get a reproduce when I run the run_squad_baseline.sh.

However, when I changed the deepspeed_bsz24_config.json file, it gave me the following warning and I could only get 'loss=nan'. Besides, if I used the original config file, it gave me the same result.

[INFO] [deepspeed_utils.py:118:_handle_overflow] rank 0 detected overflow nan in tensor 0:0 shape torch.Size([30528, 1024]) | 3/29324 [00:00<2:33:39, 3.18it/s]
[2020-08-20 14:38:13,808] [INFO] [zero_optimizer_stage1.py:621:step] [deepspeed] OVERFLOW! Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.

The config file is like this:

{
"train_batch_size": 12,
"train_micro_batch_size_per_gpu": 3,
"steps_per_print": 10,
"optimizer": {
"type": "Adam",
"params": {
"lr": 3e-5,
"weight_decay": 0.0,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 1
}
}

Could you help me fix it?
Thanks!

Tony

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions