Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN or Inf found in input tensor & KeyError: 'loss' #9

Open
cooelf opened this issue Jun 20, 2019 · 1 comment
Open

NaN or Inf found in input tensor & KeyError: 'loss' #9

cooelf opened this issue Jun 20, 2019 · 1 comment

Comments

@cooelf
Copy link

cooelf commented Jun 20, 2019

Hi, I tried the baseline model but encountered some issues to report for help.

Firstly, there are some mismatched dataset names from the download script and the baseline scripts. Those are,
TriviaQA-web.jsonl.gz -> TriviaQA.jsonl.gz
NaturalQuestionsShort.jsonl.gz -> NaturalQuestions.jsonl.gz

Then I change the names in the sample BERT large script to the actual downloaded ones. But during validating, some warnings are raised followed by an exception.

The exception,
0%| | 0/1 [00:00<?, ?it/s] EM: 69.6104, f1: 77.3335, qas_used_fraction: 1.0000, loss: 3.5823 ||: : 21it [00:10, 2.03it/s] EM: 68.3673, f1: 75.7397, qas_used_fraction: 1.0000, loss: 3.8775 ||: : 42it [00:21, 1.99it/s] EM: 66.5335, f1: 75.1974, qas_used_fraction: 1.0000, loss: 4.1714 ||: : 62it [00:31, 1.97it/s] EM: 66.6667, f1: 75.2976, qas_used_fraction: 1.0000, loss: 4.2777 ||: : 82it [00:42, 1.94it/s] Traceback (most recent call last): File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 21, in <module> run() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main args.func(args) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in <listcomp> losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) KeyError: 'loss'

My running script is,
python -m allennlp.run train MRQA_BERTLarge.jsonnet -s Models/baseline -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': 'data/train/SQuAD.jsonl.gz,data/train/NewsQA.jsonl.gz,data/train/HotpotQA.jsonl.gz,data/train/SearchQA.jsonl.gz,data/train/TriviaQA.jsonl.gz,data/train/NaturalQuestions.jsonl.gz', 'validation_data_path': 'data/in-domain/SQuAD.jsonl.gz,data/in-domain/NewsQA.jsonl.gz,data/in-domain/HotpotQA.jsonl.gz,data/in-domain/SearchQA.jsonl.gz,data/in-domain/TriviaQA.jsonl.gz,data/in-domain/NaturalQuestions.jsonl.gz', 'iterator':{'batch_size':6},'trainer': {'cuda_device': [0,1,2,3], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '145000'}}}" --include-package mrqa_allennlp

The warning is,
2019-06-19 17:24:01,568 - INFO - allennlp.common.params - trainer.patience = 10 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.validation_metric = +f1 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.shuffle = True 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.num_epochs = 2 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.cuda_device = [0, 1, 2, 3] 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_norm = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_clipping = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.momentum_scheduler = None 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.type = bert_adam 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.parameter_groups = None 2019-06-19 17:24:04,943 - INFO - allennlp.training.optimizers - Number of trainable parameters: 335143963 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently. 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS: 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.t_total = 145000 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.lr = 3e-05 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 20 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.model_save_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.summary_interval = 100 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.histogram_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.log_batch_size_period = None 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Beginning training. 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Epoch 0/1 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 46176.068 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 2246 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 11 2019-06-19 17:24:05,761 - INFO - allennlp.training.trainer - Training 2019-06-20 01:12:28,016 - INFO - allennlp.training.trainer - Validating Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor.

@alontalmor
Copy link
Collaborator

alontalmor commented Jun 26, 2019

Hi, thanks for reaching out!

The baseline has a cache mechanism so you can use the web urls and training files don't need to be downloaded before (we will make this clear in the readme)

Regarding the error, i've issued a fix. But it will take some time to retrain and check if the error does not repeat at any point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants