You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.
The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.
Attempts:
Params
Iteration
Train Acc
Val Acc
Train Loss
Val Loss
num GPU = 1 batch size = 1
1
0.14
0.15
2.38
2.35
2
0.18
0.14
2.01
2.35
3
0.22
0.15
1.80
2.43
4
0.26
0.14
1.63
2.56
5
0.30
0.14
1.50
2.71
num GPU = 3 batch size = 1
1
0.13
0.14
0.79
2.30
2
0.17
0.14
0.67
2.26
3
0.20
0.15
0.61
2.28
4
0.22
0.15
0.57
2.33
5
0.25
0.14
0.53
2.38
num GPU = 1 batch size = 15
1
0.10
0.13
0.18
2.32
2
0.13
0.13
0.15
2.28
3
0.15
0.14
0.14
2.27
4
0.17
0.14
0.13
2.27
5
0.18
0.14
0.13
2.31
num GPU = 3 batch size = 15
1
0.10
0.12
0.06
2.35
2
0.12
0.13
0.05
2.30
3
0.14
0.13
0.05
2.29
4
0.15
0.14
0.05
2.28
5
0.16
0.13
0.05
2.27
num GPU = 8 batch size = 12
1
0.09
0.11
0.03
2.41
2
0.12
0.12
0.03
2.34
3
0.13
0.13
0.02
2.31
4
0.14
0.13
0.02
2.29
5
0.14
0.14
0.02
2.29
Request:
Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?
Thank you for your time.
The text was updated successfully, but these errors were encountered:
What do you mean by "not converging"? Also if I remember correctly, you don't need the CALM model to have near-zero training loss, or even converging training loss to function. Maybe just follow the codebase and run the RL experiments and see the scores? The train/test losses of LMs are not that valuable.
I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.
Environment:
Python version: 3.6.15
Operating System: Ubuntu 20.04
GPU: Nvidia Titan RTX
Dependencies: torch==1.4, transformers==2.5.1, jericho, fasttext, wandb, importlib_metadata
Issue:
The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.
Attempts:
batch size = 1
batch size = 1
batch size = 15
batch size = 15
batch size = 12
Request:
Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?
Thank you for your time.
The text was updated successfully, but these errors were encountered: