Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when run with multiple GPUs #4

Open
goodbai-nlp opened this issue Sep 19, 2019 · 1 comment
Open

Error when run with multiple GPUs #4

goodbai-nlp opened this issue Sep 19, 2019 · 1 comment

Comments

@goodbai-nlp
Copy link

Hi,

I often ran into the following error when starting a multi-GPU training.

Traceback (most recent call last):
  File "train.py", line 116, in <module>
    main(opt)
  File "train.py", line 44, in main
    p.join()
  File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
TypeError: signal_handler() takes 1 positional argument but 3 were given

The parameters I used are:

CUDA_VISIBLE_DEVICES=0,1 python3 train.py \
                        -data $data_prefix \
                        -save_model $model_dir \
                        -world_size 2 \
                        -gpu_ranks 0 1 \
                        -save_checkpoint_steps 5000 \
                        -valid_steps 5000 \
                        -report_every 20 \
                        -keep_checkpoint 50 \
                        -seed 3435 \
                        -train_steps 300000 \
                        -warmup_steps 16000 \
                        --share_decoder_embeddings \
                        -share_embeddings \
                        --position_encoding \
                        --optim adam \
                        -adam_beta1 0.9 \
                        -adam_beta2 0.98 \
                        -decay_method noam \
                        -learning_rate 0.5 \
                        -max_grad_norm 0.0 \
                        -batch_size 4096 \
                        -batch_type tokens \
                        -normalization tokens \
                        -dropout 0.3 \
                        -label_smoothing 0.1 \
                        -max_generator_batches 100 \
                        -param_init 0.0 \
                        -param_init_glorot \
                        -valid_batch_size 8

I got this error on Ubuntu16.04, Python3.6, Pytorch 1.0.1. Can someone help me understand what's the cause of it? I would really appreciate your help, thank you!

@Amazing-J
Copy link
Owner

Aha,I see. This is a bug in our code running multi-GPU. There's a " def signal_handler() " function in the " train.py " that you need to change to " def signal_handler(self, signalnum, stackframe) " . We normally use a single GPU for training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants