-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distribution trainning for transformer core dump #11387
Comments
coredump info with paddle version 19fd071 |
more debug info with paddle version c36dd3b |
checked offline, it's because of setting wrong trainer_id |
paddle version 19fd071
transformer trainning script: PaddlePaddle/models#982
cmd for reproducing core:
python -u train.py --src_vocab_fpath /paddle/dataset/nist06n/cn_30001.dict --trg_vocab_fpath /paddle/dataset/nist06n/en_30001.dict --train_file_pattern '/paddle/train/part-*' --use_token_batch True --batch_size 1024 --pool_size 10000 --shuffle True --shuffle_batch True --sort_type pool --special_token '_GO' '_EOS' '_UNK'
Environment variable for psserver:
export PADDLE_PSERVERS=127.0.0.1
export POD_IP=127.0.0.1
export PADDLE_TRAINERS_NUM=2
export PADDLE_TRAINER_ID=0
export TRAINING_ROLE=PSERVER
export PADDLE_IS_LOCAL=0
export PADDLE_PORT=6176
Environment variable for trainers:
export CUDA_VISIBLE_DEVICES=3
export TRAINING_ROLE=TRAINER
export PADDLE_PSERVERS=127.0.0.1
export POD_IP=127.0.0.1
export PADDLE_TRAINERS_NUM=2
export PADDLE_TRAINER_ID=1
export PADDLE_IS_LOCAL=0
export PADDLE_PORT=6176
export CUDA_VISIBLE_DEVICES=2
export TRAINING_ROLE=TRAINER
export PADDLE_PSERVERS=127.0.0.1
export POD_IP=127.0.0.1
export PADDLE_TRAINERS_NUM=2
export PADDLE_TRAINER_ID=2
export PADDLE_IS_LOCAL=0
export PADDLE_PORT=6176
The text was updated successfully, but these errors were encountered: