Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError while using --optimize_on_cpu #23

Closed
rsanjaykamath opened this issue Nov 15, 2018 · 3 comments
Closed

ValueError while using --optimize_on_cpu #23

rsanjaykamath opened this issue Nov 15, 2018 · 3 comments

Comments

@rsanjaykamath
Copy link

Traceback (most recent call last): | 1/87970 [00:00<8:35:35, 2.84it/s]
File "./run_squad.py", line 990, in
main()
File "./run_squad.py", line 922, in main
is_nan = set_optimizer_params_grad(param_optimizer, model.named_parameters(), test_nan=True)
File "./run_squad.py", line 691, in set_optimizer_params_grad
if test_nan and torch.isnan(param_model.grad).sum() > 0:
File "/people/sanjay/anaconda2/envs/bert_pytorch/lib/python3.5/site-packages/torch/functional.py", line 289, in isnan
raise ValueError("The argument is not a tensor", str(tensor))
ValueError: ('The argument is not a tensor', 'None')

Command:
CUDA_VISIBLE_DEVICES=0 python ./run_squad.py
--vocab_file bert_large/uncased_L-24_H-1024_A-16/vocab.txt
--bert_config_file bert_large/uncased_L-24_H-1024_A-16/bert_config.json
--init_checkpoint bert_large/uncased_L-24_H-1024_A-16/pytorch_model.bin
--do_lower_case
--do_train
--do_predict
--train_file squad_dir/train-v1.1.json
--predict_file squad_dir/dev-v1.1.json
--learning_rate 3e-5
--num_train_epochs 2
--max_seq_length 384
--doc_stride 128
--output_dir outputs
--train_batch_size 4
--gradient_accumulation_steps 2
--optimize_on_cpu

Error while using --optimize_on_cpu only.
Works fine without the argument.

GPU: Nvidia GTX 1080Ti Single GPU.

PS: I can only fit in train_batch_size 4 on the memory of a single GPU.

@thomwolf
Copy link
Member

thomwolf commented Nov 15, 2018

Thanks! I pushed a fix for that, you can try it again. You should be able to increase a bit the batch size.

By the way, the real batch size that is used on the gpu is train_batch_size / gradient_accumulation_steps so 2 in your case. I think you should be able to go to 3 with --optimize_on_cpu

The recommended batch_size to get good results (EM, F1) with BERT large on SQuaD is 24. You can try the following possibilities to get to this batch_size:

  • keeping the same 'real batch size' that you currently have but just a bigger batch_size --train_batch_size 24 --gradient_accumulation_steps 12
  • trying a 'real batch size' of 3 with optimization on cpu --train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu
  • switching to fp16 (implies optimization on cpu): --train_batch_size 24 --gradient_accumulation_steps 6 or 4 --fp16

If your GPU supports fp16, the last solution should be the fastest, otherwise the second should be the fastest. The first solution should work out-of-the box and give better results (EM, F1) but you won't have any speed-up.

@thomwolf
Copy link
Member

Should be fixed now. Don't hesitate to re-open an issue if needed. Thanks for the feedback!

@rsanjaykamath
Copy link
Author

Yes it works now!

With

--train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu

I get {"exact_match": 83.78429517502366, "f1": 90.75733469379139} which is pretty close.

Thanks for this amazing work!

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
add quac codalab submission pipeline (cont.)
LysandreJik pushed a commit to LysandreJik/transformers that referenced this issue Aug 17, 2021
jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023
ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023
jonb377 pushed a commit to jonb377/hf-transformers that referenced this issue Nov 3, 2023
Summary:
This pull request adds a debug option to print activation sharding information.

To use it, just pass--spmd_debug.

Test Plan:
Tested locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants