some questions about training #3

jessapinkman · 2024-03-20T07:17:02Z

hi,

When I try to run the main script to train the model, I get the following problem:
Traceback (most recent call last):
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 272, in
model, train_metrics, dev_loader = run_training_loop(params, outf=f, serialdir=serialdir, config=CONFIG)
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 131, in run_training_loop
metrics = trainer.train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
metrics, epoch = self._try_train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
train_metrics = self._train_epoch(epoch)
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 515, in _train_epoch
raise ValueError("nan loss encountered")
ValueError: nan loss encountered

This looks like the data has some invalid values, how do I handle it?

also， here is my directory about dataset, is that correct?

├─ data
│ ├─ daic
│ │ ├─ 300_TRANSCRIPT.csv
│ │ ├─ 301_TRANSCRIPT.csv
│ │ └─ 304_TRANSCRIPT.csv
│ ├─ dailydialog
│ │ ├─ .DS_Store
│ │ ├─ dialogues_act.txt
│ │ ├─ dialogues_emotion.txt
│ │ ├─ dialogues_text.txt
│ │ ├─ dialogues_topic.txt
│ │ ├─ ijcnlp_dailydialog
│ │ │ ├─ .DS_Store
│ │ │ ├─ dialogues_act.txt
│ │ │ ├─ dialogues_emotion.txt
│ │ │ ├─ dialogues_text.txt
│ │ │ ├─ dialogues_topic.txt
│ │ │ ├─ readme.txt
│ │ │ ├─ test.zip
│ │ │ ├─ train.zip
│ │ │ └─ validation.zip
│ │ ├─ readme.txt
│ │ ├─ test
│ │ │ ├─ dialogues_act_test.txt
│ │ │ ├─ dialogues_emotion_test.txt
│ │ │ └─ dialogues_test.txt
│ │ ├─ train
│ │ │ ├─ dialogues_act_train.txt
│ │ │ ├─ dialogues_emotion_train.txt
│ │ │ └─ dialogues_train.txt
│ │ └─ validation
│ │ ├─ dialogues_act_validation.txt
│ │ ├─ dialogues_emotion_validation.txt
│ │ └─ dialogues_validation.txt
│ └─ ijcnlp_dailydialog.zip

jessapinkman · 2024-03-20T13:39:44Z

btw， could u send me the whole DAIC dataset like xxx_TRANSCRIPT? Downloading many .zip files is too time-consuming.

My email is: pinkman@stu.xjtu.edu.cn

Thanks!

jessapinkman · 2024-03-21T01:36:57Z

@chuyuanli I would appreciate it if u could help me.

chuyuanli · 2024-03-21T04:33:23Z

Hello, thanks for your interest.
It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files.
An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

jessapinkman · 2024-03-21T05:19:13Z

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

Thank you for your reply.

In order to quickly start training the model, I only downloaded part of the DAIC data set. After I applied for the data set, they gave me such a URL, but I need to download each sample one by one, which is time-consuming. Do you still keep the entire data set? I only need the text file of each sample.

The "nan loss" problem I encountered occurred when loading the data set. I just ran the main.py file, which seemed to not fully start training the model. I tried to print out the shapes and data of label_act, label_emo, label_phq, and label_topic. Every lable has the same shape as the predictions tensor without the num_classes dimension. And i did not find the missing value "nan" in the data, but there were some negative integers "-1". Is this possibly the reason for the error? Here is the print from my console:

(mtl) C:\Users\jessa\Desktop\MTL4Depr-master>D:/conda/envs/mtl/python.exe c:/Users/jessa/Desktop/MTL4Depr-master/src/main.py
11118 1000 1000
3 2 5
building vocab: 100%|##########| 11121/11121 [00:00<00:00, 29514.59it/s]
Building the model...
D:\conda\envs\mtl\lib\site-packages\torch\cuda\memory.py:278: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
0%| | 0/696 [00:00<?, ?it/s]tensor([[ 0, 1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, 1, 0, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, -1, -1,
-1, -1],
[ 2, 3, 1, 0, 1, 0, 1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, 2, 3, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 2, 3, 2, 3, 0, 1, 2, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 2, 3,
0, 0],
[ 1, 0, 1, 0, 2, 2, 3, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 2, 2, 3, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 1, 0, 1, 0, 1, 0, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 1, 0, 2, 3, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 2, 0, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1]], device='cuda:0') tensor([[ 6, 6, 4, 6, 4, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 4, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 6, 0, 0, 0, 0, 0, 4, 0, 0, 4, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 5, 6, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 6, 4, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1]], device='cuda:0') tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
device='cuda:0') tensor([4, 7, 4, 0, 4, 4, 4, 4, 0, 4, 5, 3, 7, 0, 4, 7], device='cuda:0')
0%| | 0/696 [00:00<?, ?it/s]
Traceback (most recent call last):
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 273, in
model, train_metrics, dev_loader = run_training_loop(params, outf=f, serialdir=serialdir, config=CONFIG)
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 131, in run_training_loop
metrics = trainer.train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
metrics, epoch = self._try_train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
train_metrics = self._train_epoch(epoch)
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 515, in _train_epoch
raise ValueError("nan loss encountered")
ValueError: nan loss encountered

jessapinkman · 2024-03-24T07:47:09Z

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

Actually, when I set has_emo\has_topic\has_act = False, the model could trained well, but when one of these three parameters is set to True, an error will be reported(raise ValueError("nan loss encountered")). I checked the data and it was caused by the lack of phq label in the dailydialog dataset. In dataset_reader.py you fill in missing values by using -1, but when all labels are -1, the value of the loss is "nan", how can I fix it? @chuyuanli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some questions about training #3

some questions about training #3

jessapinkman commented Mar 20, 2024 •

edited

Loading

jessapinkman commented Mar 20, 2024

jessapinkman commented Mar 21, 2024

chuyuanli commented Mar 21, 2024

jessapinkman commented Mar 21, 2024

jessapinkman commented Mar 24, 2024 •

edited

Loading

some questions about training #3

some questions about training #3

Comments

jessapinkman commented Mar 20, 2024 • edited Loading

jessapinkman commented Mar 20, 2024

jessapinkman commented Mar 21, 2024

chuyuanli commented Mar 21, 2024

jessapinkman commented Mar 21, 2024

jessapinkman commented Mar 24, 2024 • edited Loading

jessapinkman commented Mar 20, 2024 •

edited

Loading

jessapinkman commented Mar 24, 2024 •

edited

Loading