Skip to content

"Add bfloat16 data type" 引起bert fp16 模型训练挂掉 #27205

@hysunflower

Description

@hysunflower

this issue is caused by #25402

System information
1)PaddlePaddle version:develop commitID:95e1434bb2fc8fd43a519cfa60ae36845a0cf2ef
2)GPU:V100 16G CUDA10.1 CUDNN7.6.5
3)CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
4)OS:Ubuntu 16.04.6
5) Python3.7
6) compile docker images: paddlepaddle/paddle_manylinux_devel:cuda10.1_cudnn7
7) runtime docker images: paddlepaddle/paddle:latest-gpu-cuda10.1-cudnn7

build command

  • step 1
export CMAKE_BUILD_TYPE=Release
export PYTHON_ABI=cp37-cp37m
export PADDLE_VERSION=0.0.0
export WITH_DOC=OFF
export WITH_AVX=ON
export WITH_GPU=ON
export WITH_TEST=OFF
export RUN_TEST=OFF
export WITH_GOLANG=OFF
export WITH_SWIG_PY=ON
export WITH_PYTHON=ON
export WITH_C_API=OFF
export WITH_STYLE_CHECK=OFF
export WITH_TESTING=OFF
export CMAKE_EXPORT_COMPILE_COMMANDS=ON
export WITH_MKL=ON
export BUILD_TYPE=Release
export WITH_DISTRIBUTE=ON
export WITH_FLUID_ONLY=OFF
export CMAKE_VERBOSE_MAKEFILE=OFF
  • step 2
bash paddle/script/paddle_build.sh build 

running

CUDA_VISIBLE_DEVICES=1 python -u run_classifier.py  
--task_name mnli           
--use_cuda true
--do_train true           
--do_val False          
--do_test     False          
--batch_size 8           
--in_tokens False           
--init_pretraining_params /path/to/uncased_L-24_H-1024_A-16/params           
--data_dir /path/to/MNLI           
--vocab_path /path/to/uncased_L-24_H-1024_A-16/vocab.txt
--checkpoints /path/to/save           
--save_steps 1000           
--weight_decay  0.01           
--warmup_proportion 0.1          
--validation_steps 1000           
--epoch 2           
--is_profiler=0
--max_iter=1500
--max_seq_len 128
--bert_config_path     /path/to/uncased_L-24_H-1024_A-16/bert_config.json
--learning_rate 5e-5
--skip_steps 100
--random_seed 1

err message

Cast parameters to float16 data format.
Traceback (most recent call last):
  File "run_classifier.py", line 451, in <module>
    main(args)
  File "run_classifier.py", line 290, in main
    use_fp16=args.use_fp16)
  File "/models/PaddleNLP/pretrain_language_models/BERT/utils/init.py", line 60, in init_pretraining_params
    cast_fp32_to_fp16(exe, main_program)
  File "/models/PaddleNLP/pretrain_language_models/BERT/utils/init.py", line 33, in cast_fp32_to_fp16
    param_t.set(np.float16(data).view(np.uint16), exe.place)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
1   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
2   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
ExternalError:  Cuda error(1), invalid argument.
  [Advise: The device function being invoked (usually via cudaLaunchKernel()) was not previously configured via the cudaConfigureCall() function.] (at /paddle/paddle/fluid/platform/gpu_info.cc:291)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions