You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I tried the baseline model but encountered some issues to report for help.
Firstly, there are some mismatched dataset names from the download script and the baseline scripts. Those are,
TriviaQA-web.jsonl.gz -> TriviaQA.jsonl.gz
NaturalQuestionsShort.jsonl.gz -> NaturalQuestions.jsonl.gz
Then I change the names in the sample BERT large script to the actual downloaded ones. But during validating, some warnings are raised followed by an exception.
The exception, 0%| | 0/1 [00:00<?, ?it/s] EM: 69.6104, f1: 77.3335, qas_used_fraction: 1.0000, loss: 3.5823 ||: : 21it [00:10, 2.03it/s] EM: 68.3673, f1: 75.7397, qas_used_fraction: 1.0000, loss: 3.8775 ||: : 42it [00:21, 1.99it/s] EM: 66.5335, f1: 75.1974, qas_used_fraction: 1.0000, loss: 4.1714 ||: : 62it [00:31, 1.97it/s] EM: 66.6667, f1: 75.2976, qas_used_fraction: 1.0000, loss: 4.2777 ||: : 82it [00:42, 1.94it/s] Traceback (most recent call last): File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 21, in <module> run() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main args.func(args) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in <listcomp> losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) KeyError: 'loss'
The warning is, 2019-06-19 17:24:01,568 - INFO - allennlp.common.params - trainer.patience = 10 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.validation_metric = +f1 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.shuffle = True 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.num_epochs = 2 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.cuda_device = [0, 1, 2, 3] 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_norm = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_clipping = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.momentum_scheduler = None 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.type = bert_adam 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.parameter_groups = None 2019-06-19 17:24:04,943 - INFO - allennlp.training.optimizers - Number of trainable parameters: 335143963 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently. 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS: 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.t_total = 145000 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.lr = 3e-05 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 20 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.model_save_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.summary_interval = 100 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.histogram_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.log_batch_size_period = None 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Beginning training. 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Epoch 0/1 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 46176.068 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 2246 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 11 2019-06-19 17:24:05,761 - INFO - allennlp.training.trainer - Training 2019-06-20 01:12:28,016 - INFO - allennlp.training.trainer - Validating Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor.
The text was updated successfully, but these errors were encountered:
The baseline has a cache mechanism so you can use the web urls and training files don't need to be downloaded before (we will make this clear in the readme)
Regarding the error, i've issued a fix. But it will take some time to retrain and check if the error does not repeat at any point.
Hi, I tried the baseline model but encountered some issues to report for help.
Firstly, there are some mismatched dataset names from the download script and the baseline scripts. Those are,
TriviaQA-web.jsonl.gz -> TriviaQA.jsonl.gz
NaturalQuestionsShort.jsonl.gz -> NaturalQuestions.jsonl.gz
Then I change the names in the sample BERT large script to the actual downloaded ones. But during validating, some warnings are raised followed by an exception.
The exception,
0%| | 0/1 [00:00<?, ?it/s] EM: 69.6104, f1: 77.3335, qas_used_fraction: 1.0000, loss: 3.5823 ||: : 21it [00:10, 2.03it/s] EM: 68.3673, f1: 75.7397, qas_used_fraction: 1.0000, loss: 3.8775 ||: : 42it [00:21, 1.99it/s] EM: 66.5335, f1: 75.1974, qas_used_fraction: 1.0000, loss: 4.1714 ||: : 62it [00:31, 1.97it/s] EM: 66.6667, f1: 75.2976, qas_used_fraction: 1.0000, loss: 4.2777 ||: : 82it [00:42, 1.94it/s] Traceback (most recent call last): File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 21, in <module> run() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/run.py", line 18, in run main(prog="allennlp") File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in main args.func(args) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args args.cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file cache_directory, cache_prefix) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model metrics = trainer.train() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 493, in train val_loss, num_batches = self._validation_loss() File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss loss = self.batch_loss(batch_group, for_training=False) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/trainer.py", line 258, in batch_loss output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in data_parallel losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) File "/home/coulson/anaconda3/envs/mrc/lib/python3.6/site-packages/allennlp/training/util.py", line 336, in <listcomp> losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0) KeyError: 'loss'
My running script is,
python -m allennlp.run train MRQA_BERTLarge.jsonnet -s Models/baseline -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': 'data/train/SQuAD.jsonl.gz,data/train/NewsQA.jsonl.gz,data/train/HotpotQA.jsonl.gz,data/train/SearchQA.jsonl.gz,data/train/TriviaQA.jsonl.gz,data/train/NaturalQuestions.jsonl.gz', 'validation_data_path': 'data/in-domain/SQuAD.jsonl.gz,data/in-domain/NewsQA.jsonl.gz,data/in-domain/HotpotQA.jsonl.gz,data/in-domain/SearchQA.jsonl.gz,data/in-domain/TriviaQA.jsonl.gz,data/in-domain/NaturalQuestions.jsonl.gz', 'iterator':{'batch_size':6},'trainer': {'cuda_device': [0,1,2,3], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '145000'}}}" --include-package mrqa_allennlp
The warning is,
2019-06-19 17:24:01,568 - INFO - allennlp.common.params - trainer.patience = 10 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.validation_metric = +f1 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.shuffle = True 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.num_epochs = 2 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.cuda_device = [0, 1, 2, 3] 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_norm = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.grad_clipping = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.learning_rate_scheduler = None 2019-06-19 17:24:01,569 - INFO - allennlp.common.params - trainer.momentum_scheduler = None 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.type = bert_adam 2019-06-19 17:24:04,942 - INFO - allennlp.common.params - trainer.optimizer.parameter_groups = None 2019-06-19 17:24:04,943 - INFO - allennlp.training.optimizers - Number of trainable parameters: 335143963 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - trainer.optimizer.infer_type_and_cast = True 2019-06-19 17:24:04,943 - INFO - allennlp.common.params - Converting Params object to dict; logging of default values will not occur when dictionary parameters are used subsequently. 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - CURRENTLY DEFINED PARAMETERS: 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.t_total = 145000 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.warmup = 0.1 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.optimizer.lr = 3e-05 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.num_serialized_models_to_keep = 20 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.keep_serialized_model_every_num_seconds = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.model_save_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.summary_interval = 100 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.histogram_interval = None 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_parameter_statistics = True 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.should_log_learning_rate = False 2019-06-19 17:24:04,944 - INFO - allennlp.common.params - trainer.log_batch_size_period = None 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Beginning training. 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Epoch 0/1 2019-06-19 17:24:04,949 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 46176.068 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 2246 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 11 2019-06-19 17:24:05,758 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 11 2019-06-19 17:24:05,761 - INFO - allennlp.training.trainer - Training 2019-06-20 01:12:28,016 - INFO - allennlp.training.trainer - Validating Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor. Warning: NaN or Inf found in input tensor.
The text was updated successfully, but these errors were encountered: