NaN or Inf found in input tensor & KeyError: 'loss' #9

cooelf · 2019-06-20T01:14:18Z

Hi, I tried the baseline model but encountered some issues to report for help.

Firstly, there are some mismatched dataset names from the download script and the baseline scripts. Those are,
TriviaQA-web.jsonl.gz -> TriviaQA.jsonl.gz
NaturalQuestionsShort.jsonl.gz -> NaturalQuestions.jsonl.gz

Then I change the names in the sample BERT large script to the actual downloaded ones. But during validating, some warnings are raised followed by an exception.

The exception,
0%| | 0/1 [00:00<?, ?it/s] EM: 69.6104, f1: 77.3335, qas_used_fraction: 1.0000, loss: 3.5823 ||: : 21it [00:10, 2.03it/s] EM: 68.3673, f1: 75.7397, qas_used_fraction: 1.0000, loss: 3.8775 ||: : 42it [00:21, 1.99it/s] EM: 66.5335, f1: 75.1974, qas_used_fraction: 1.0000, loss: 4.1714 ||: : 62it [00:31, 1.97it/s] EM: 66.6667, f1: 75.2976, qas_used_fraction: 1.0000, loss: 4.2777 ||: : 82it [00:42, 1.94it/s] Traceback (most recent call last): File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 21, in <module> run() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main args.func(args) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in <listcomp> losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) KeyError: 'loss'

My running script is,
python -m allennlp.run train MRQA_BERTLarge.jsonnet -s Models/baseline -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': 'data/train/SQuAD.jsonl.gz,data/train/NewsQA.jsonl.gz,data/train/HotpotQA.jsonl.gz,data/train/SearchQA.jsonl.gz,data/train/TriviaQA.jsonl.gz,data/train/NaturalQuestions.jsonl.gz', 'validation_data_path': 'data/in-domain/SQuAD.jsonl.gz,data/in-domain/NewsQA.jsonl.gz,data/in-domain/HotpotQA.jsonl.gz,data/in-domain/SearchQA.jsonl.gz,data/in-domain/TriviaQA.jsonl.gz,data/in-domain/NaturalQuestions.jsonl.gz', 'iterator':{'batch_size':6},'trainer': {'cuda_device': [0,1,2,3], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '145000'}}}" --include-package mrqa_allennlp

The warning is,
2019-06-19 17:24:01,568 - INFO - allennlp.common.params - trainer.patience = 10 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.validation_metric = +f1 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.shuffle = True 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.num_epochs = 2 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.cuda_device = [0, 1, 2, 3] 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_norm = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_clipping = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.momentum_scheduler = None 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.type = bert_adam 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.parameter_groups = None 2019-06-19 17:24:04,943 - INFO - allennlp.training.optimizers - Number of trainable parameters: 335143963 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently. 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS: 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.t_total = 145000 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.lr = 3e-05 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 20 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.model_save_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.summary_interval = 100 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.histogram_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.log_batch_size_period = None 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Beginning training. 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Epoch 0/1 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 46176.068 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 2246 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 11 2019-06-19 17:24:05,761 - INFO - allennlp.training.trainer - Training 2019-06-20 01:12:28,016 - INFO - allennlp.training.trainer - Validating Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor.

The text was updated successfully, but these errors were encountered:

alontalmor · 2019-06-26T09:33:21Z

Hi, thanks for reaching out!

The baseline has a cache mechanism so you can use the web urls and training files don't need to be downloaded before (we will make this clear in the readme)

Regarding the error, i've issued a fix. But it will take some time to retrain and check if the error does not repeat at any point.

ZHO9504 mentioned this issue Jul 20, 2019

Error found in validating when use 2 gpu(But it'ok when using one gpu ).. #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN or Inf found in input tensor & KeyError: 'loss' #9

NaN or Inf found in input tensor & KeyError: 'loss' #9

cooelf commented Jun 20, 2019

alontalmor commented Jun 26, 2019 •

edited

Loading

NaN or Inf found in input tensor & KeyError: 'loss' #9

NaN or Inf found in input tensor & KeyError: 'loss' #9

Comments

cooelf commented Jun 20, 2019

alontalmor commented Jun 26, 2019 • edited Loading

alontalmor commented Jun 26, 2019 •

edited

Loading