We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ckbert使用自己的领域语料继续预训练,发现语料一大(12GB),训练时间一久,机器就会自动重启,小语料(2G)的情况下没有出现问题。
遂训练时观察内存使用情况发现内存占用随着训练进度推进而逐渐增大,最终占完所有内存。
是否有大神面临同样的问题?十分感激能有人回复!
以下是我的训练参数:
export CUDA_VISIBLE_DEVICES=0,1
gpu_number=2 negative_e_number=4 negative_e_length=16
python -m torch.distributed.launch --nproc_per_node=$gpu_number --master_port=52349 $base_dir/main.py --mode=train --worker_gpu=$gpu_number --tables=$local_train_file, --learning_rate=1e-3 --epoch_num=1 --logging_steps=100 --save_checkpoint_steps=1000 --sequence_length=512 --train_batch_size=4 --checkpoint_dir=$checkpoint_dir --app_name=language_modeling --use_amp --save_all_checkpoints --user_defined_parameters="pretrain_model_name_or_path=alibaba-pai/pai-ck_bert-base-zh external_mask_flag=True contrast_learning_flag=True negative_e_number=${negative_e_number} negative_e_length=${negative_e_length} kg_path=${local_kg}"
The text was updated successfully, but these errors were encountered:
No branches or pull requests
ckbert使用自己的领域语料继续预训练,发现语料一大(12GB),训练时间一久,机器就会自动重启,小语料(2G)的情况下没有出现问题。
遂训练时观察内存使用情况发现内存占用随着训练进度推进而逐渐增大,最终占完所有内存。
是否有大神面临同样的问题?十分感激能有人回复!
以下是我的训练参数:
export CUDA_VISIBLE_DEVICES=0,1
gpu_number=2
negative_e_number=4
negative_e_length=16
python -m torch.distributed.launch --nproc_per_node=$gpu_number
--master_port=52349
$base_dir/main.py
--mode=train
--worker_gpu=$gpu_number
--tables=$local_train_file,
--learning_rate=1e-3
--epoch_num=1
--logging_steps=100
--save_checkpoint_steps=1000
--sequence_length=512
--train_batch_size=4
--checkpoint_dir=$checkpoint_dir
--app_name=language_modeling
--use_amp
--save_all_checkpoints
--user_defined_parameters="pretrain_model_name_or_path=alibaba-pai/pai-ck_bert-base-zh external_mask_flag=True contrast_learning_flag=True negative_e_number=${negative_e_number} negative_e_length=${negative_e_length} kg_path=${local_kg}"
The text was updated successfully, but these errors were encountered: