"Add bfloat16 data type" 引起bert fp16 模型训练挂掉

this issue is caused by #25402 

**System information**
1）PaddlePaddle version：develop  commitID：95e1434bb2fc8fd43a519cfa60ae36845a0cf2ef
2）GPU：V100 16G CUDA10.1 CUDNN7.6.5
3）CPU：Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
4）OS：Ubuntu 16.04.6 
5)  Python3.7
6)  compile docker images: paddlepaddle/paddle_manylinux_devel:cuda10.1_cudnn7
7) runtime docker images: paddlepaddle/paddle:latest-gpu-cuda10.1-cudnn7

**build command**
  * step 1
``` shell
export CMAKE_BUILD_TYPE=Release
export PYTHON_ABI=cp37-cp37m
export PADDLE_VERSION=0.0.0
export WITH_DOC=OFF
export WITH_AVX=ON
export WITH_GPU=ON
export WITH_TEST=OFF
export RUN_TEST=OFF
export WITH_GOLANG=OFF
export WITH_SWIG_PY=ON
export WITH_PYTHON=ON
export WITH_C_API=OFF
export WITH_STYLE_CHECK=OFF
export WITH_TESTING=OFF
export CMAKE_EXPORT_COMPILE_COMMANDS=ON
export WITH_MKL=ON
export BUILD_TYPE=Release
export WITH_DISTRIBUTE=ON
export WITH_FLUID_ONLY=OFF
export CMAKE_VERBOSE_MAKEFILE=OFF
```
 * step 2
``` shell
bash paddle/script/paddle_build.sh build 
```
**running**
 * model
    [/models/PaddleNLP/pretrain_language_models/BERT/](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/pretrain_language_models/BERT)
 * command
``` python
CUDA_VISIBLE_DEVICES=1 python -u run_classifier.py  
--task_name mnli           
--use_cuda true
--do_train true           
--do_val False          
--do_test     False          
--batch_size 8           
--in_tokens False           
--init_pretraining_params /path/to/uncased_L-24_H-1024_A-16/params           
--data_dir /path/to/MNLI           
--vocab_path /path/to/uncased_L-24_H-1024_A-16/vocab.txt
--checkpoints /path/to/save           
--save_steps 1000           
--weight_decay  0.01           
--warmup_proportion 0.1          
--validation_steps 1000           
--epoch 2           
--is_profiler=0
--max_iter=1500
--max_seq_len 128
--bert_config_path     /path/to/uncased_L-24_H-1024_A-16/bert_config.json
--learning_rate 5e-5
--skip_steps 100
--random_seed 1
```

**err message**
``` shell
Cast parameters to float16 data format.
Traceback (most recent call last):
  File "run_classifier.py", line 451, in <module>
    main(args)
  File "run_classifier.py", line 290, in main
    use_fp16=args.use_fp16)
  File "/models/PaddleNLP/pretrain_language_models/BERT/utils/init.py", line 60, in init_pretraining_params
    cast_fp32_to_fp16(exe, main_program)
  File "/models/PaddleNLP/pretrain_language_models/BERT/utils/init.py", line 33, in cast_fp32_to_fp16
    param_t.set(np.float16(data).view(np.uint16), exe.place)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
1   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
2   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
ExternalError:  Cuda error(1), invalid argument.
  [Advise: The device function being invoked (usually via cudaLaunchKernel()) was not previously configured via the cudaConfigureCall() function.] (at /paddle/paddle/fluid/platform/gpu_info.cc:291)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Add bfloat16 data type" 引起bert fp16 模型训练挂掉 #27205

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Add bfloat16 data type" 引起bert fp16 模型训练挂掉 #27205

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions