Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some questions about training #3

Open
jessapinkman opened this issue Mar 20, 2024 · 5 comments
Open

some questions about training #3

jessapinkman opened this issue Mar 20, 2024 · 5 comments

Comments

@jessapinkman
Copy link

jessapinkman commented Mar 20, 2024

hi,

When I try to run the main script to train the model, I get the following problem:
Traceback (most recent call last):
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 272, in
model, train_metrics, dev_loader = run_training_loop(params, outf=f, serialdir=serialdir, config=CONFIG)
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 131, in run_training_loop
metrics = trainer.train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
metrics, epoch = self._try_train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
train_metrics = self._train_epoch(epoch)
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 515, in _train_epoch
raise ValueError("nan loss encountered")
ValueError: nan loss encountered

This looks like the data has some invalid values, how do I handle it?

also, here is my directory about dataset, is that correct?

├─ data
│ ├─ daic
│ │ ├─ 300_TRANSCRIPT.csv
│ │ ├─ 301_TRANSCRIPT.csv
│ │ └─ 304_TRANSCRIPT.csv
│ ├─ dailydialog
│ │ ├─ .DS_Store
│ │ ├─ dialogues_act.txt
│ │ ├─ dialogues_emotion.txt
│ │ ├─ dialogues_text.txt
│ │ ├─ dialogues_topic.txt
│ │ ├─ ijcnlp_dailydialog
│ │ │ ├─ .DS_Store
│ │ │ ├─ dialogues_act.txt
│ │ │ ├─ dialogues_emotion.txt
│ │ │ ├─ dialogues_text.txt
│ │ │ ├─ dialogues_topic.txt
│ │ │ ├─ readme.txt
│ │ │ ├─ test.zip
│ │ │ ├─ train.zip
│ │ │ └─ validation.zip
│ │ ├─ readme.txt
│ │ ├─ test
│ │ │ ├─ dialogues_act_test.txt
│ │ │ ├─ dialogues_emotion_test.txt
│ │ │ └─ dialogues_test.txt
│ │ ├─ train
│ │ │ ├─ dialogues_act_train.txt
│ │ │ ├─ dialogues_emotion_train.txt
│ │ │ └─ dialogues_train.txt
│ │ └─ validation
│ │ ├─ dialogues_act_validation.txt
│ │ ├─ dialogues_emotion_validation.txt
│ │ └─ dialogues_validation.txt
│ └─ ijcnlp_dailydialog.zip

@jessapinkman
Copy link
Author

btw, could u send me the whole DAIC dataset like xxx_TRANSCRIPT? Downloading many .zip files is too time-consuming.

My email is: pinkman@stu.xjtu.edu.cn

Thanks!

@jessapinkman
Copy link
Author

@chuyuanli I would appreciate it if u could help me.

@chuyuanli
Copy link
Owner

Hello, thanks for your interest.
It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files.
An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

@jessapinkman
Copy link
Author

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

Thank you for your reply.

In order to quickly start training the model, I only downloaded part of the DAIC data set. After I applied for the data set, they gave me such a URL, but I need to download each sample one by one, which is time-consuming. Do you still keep the entire data set? I only need the text file of each sample.
image

The "nan loss" problem I encountered occurred when loading the data set. I just ran the main.py file, which seemed to not fully start training the model. I tried to print out the shapes and data of label_act, label_emo, label_phq, and label_topic. Every lable has the same shape as the predictions tensor without the num_classes dimension. And i did not find the missing value "nan" in the data, but there were some negative integers "-1". Is this possibly the reason for the error? Here is the print from my console:

(mtl) C:\Users\jessa\Desktop\MTL4Depr-master>D:/conda/envs/mtl/python.exe c:/Users/jessa/Desktop/MTL4Depr-master/src/main.py
11118 1000 1000
3 2 5
building vocab: 100%|##########| 11121/11121 [00:00<00:00, 29514.59it/s]
Building the model...
D:\conda\envs\mtl\lib\site-packages\torch\cuda\memory.py:278: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
0%| | 0/696 [00:00<?, ?it/s]tensor([[ 0, 1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, 1, 0, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, -1, -1,
-1, -1],
[ 2, 3, 1, 0, 1, 0, 1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, 2, 3, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 2, 3, 2, 3, 0, 1, 2, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 2, 3,
0, 0],
[ 1, 0, 1, 0, 2, 2, 3, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 2, 2, 3, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 1, 0, 1, 0, 1, 0, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 1, 0, 2, 3, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 2, 0, 1, 0, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1]], device='cuda:0') tensor([[ 6, 6, 4, 6, 4, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 4, 4, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 6, 0, 0, 0, 0, 0, 4, 0, 0, 4, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 5, 6, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4],
[ 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 6, 4, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1],
[ 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1]], device='cuda:0') tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
device='cuda:0') tensor([4, 7, 4, 0, 4, 4, 4, 4, 0, 4, 5, 3, 7, 0, 4, 7], device='cuda:0')
0%| | 0/696 [00:00<?, ?it/s]
Traceback (most recent call last):
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 273, in
model, train_metrics, dev_loader = run_training_loop(params, outf=f, serialdir=serialdir, config=CONFIG)
File "c:\Users\jessa\Desktop\MTL4Depr-master\src\main.py", line 131, in run_training_loop
metrics = trainer.train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
metrics, epoch = self._try_train()
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
train_metrics = self._train_epoch(epoch)
File "D:\conda\envs\mtl\lib\site-packages\allennlp\training\gradient_descent_trainer.py", line 515, in _train_epoch
raise ValueError("nan loss encountered")
ValueError: nan loss encountered

@jessapinkman
Copy link
Author

jessapinkman commented Mar 24, 2024

Hello, thanks for your interest. It is not clear how you encountered that issue. I would guess that the gold labels are not given or not in the correct format. Carefully check the input and output formats. Labels should be converted into integers, for instance.

About the DAIC data, you need to submit a request and then you can download the files. An example of the data repository is given in this git. Your repository looks fine. You can also check the code in dataset_reader.py for details.

Hope this helps.

Actually, when I set has_emo\has_topic\has_act = False, the model could trained well, but when one of these three parameters is set to True, an error will be reported(raise ValueError("nan loss encountered")). I checked the data and it was caused by the lack of phq label in the dailydialog dataset. In dataset_reader.py you fill in missing values by using -1, but when all labels are -1, the value of the loss is "nan", how can I fix it? @chuyuanli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants