CKBert继续预训练内存溢出 #349

rainfallLLF · 2024-02-17T11:46:09Z

ckbert使用自己的领域语料继续预训练，发现语料一大（12GB），训练时间一久，机器就会自动重启，小语料（2G）的情况下没有出现问题。

遂训练时观察内存使用情况发现内存占用随着训练进度推进而逐渐增大，最终占完所有内存。

是否有大神面临同样的问题？十分感激能有人回复！

以下是我的训练参数：

export CUDA_VISIBLE_DEVICES=0,1

gpu_number=2
negative_e_number=4
negative_e_length=16

python -m torch.distributed.launch --nproc_per_node=$gpu_number
--master_port=52349
$base_dir/main.py
--mode=train
--worker_gpu=$gpu_number
--tables=$local_train_file,
--learning_rate=1e-3
--epoch_num=1
--logging_steps=100
--save_checkpoint_steps=1000
--sequence_length=512
--train_batch_size=4
--checkpoint_dir=$checkpoint_dir
--app_name=language_modeling
--use_amp
--save_all_checkpoints
--user_defined_parameters="pretrain_model_name_or_path=alibaba-pai/pai-ck_bert-base-zh external_mask_flag=True contrast_learning_flag=True negative_e_number=${negative_e_number} negative_e_length=${negative_e_length} kg_path=${local_kg}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CKBert继续预训练内存溢出 #349

CKBert继续预训练内存溢出 #349

rainfallLLF commented Feb 17, 2024 •

edited

Loading

CKBert继续预训练内存溢出 #349

CKBert继续预训练内存溢出 #349

Comments

rainfallLLF commented Feb 17, 2024 • edited Loading

rainfallLLF commented Feb 17, 2024 •

edited

Loading