Having issues getting training to converge - hyper parameter issue? #8

dayvidwang · 2024-03-19T07:12:23Z

I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.

Environment:

Python version: 3.6.15
Operating System: Ubuntu 20.04
GPU: Nvidia Titan RTX
Dependencies: torch==1.4, transformers==2.5.1, jericho, fasttext, wandb, importlib_metadata

Issue:

The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.

Attempts:

Params	Iteration	Train Acc	Val Acc	Train Loss	Val Loss
num GPU = 1 batch size = 1	1	0.14	0.15	2.38	2.35
	2	0.18	0.14	2.01	2.35
	3	0.22	0.15	1.80	2.43
	4	0.26	0.14	1.63	2.56
	5	0.30	0.14	1.50	2.71
num GPU = 3 batch size = 1	1	0.13	0.14	0.79	2.30
	2	0.17	0.14	0.67	2.26
	3	0.20	0.15	0.61	2.28
	4	0.22	0.15	0.57	2.33
	5	0.25	0.14	0.53	2.38
num GPU = 1 batch size = 15	1	0.10	0.13	0.18	2.32
	2	0.13	0.13	0.15	2.28
	3	0.15	0.14	0.14	2.27
	4	0.17	0.14	0.13	2.27
	5	0.18	0.14	0.13	2.31
num GPU = 3 batch size = 15	1	0.10	0.12	0.06	2.35
	2	0.12	0.13	0.05	2.30
	3	0.14	0.13	0.05	2.29
	4	0.15	0.14	0.05	2.28
	5	0.16	0.13	0.05	2.27
num GPU = 8 batch size = 12	1	0.09	0.11	0.03	2.41
	2	0.12	0.12	0.03	2.34
	3	0.13	0.13	0.02	2.31
	4	0.14	0.13	0.02	2.29
	5	0.14	0.14	0.02	2.29

Request:

Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?

Thank you for your time.

ysymyth · 2024-04-05T00:00:01Z

What do you mean by "not converging"? Also if I remember correctly, you don't need the CALM model to have near-zero training loss, or even converging training loss to function. Maybe just follow the codebase and run the RL experiments and see the scores? The train/test losses of LMs are not that valuable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having issues getting training to converge - hyper parameter issue? #8

Having issues getting training to converge - hyper parameter issue? #8

dayvidwang commented Mar 19, 2024

ysymyth commented Apr 5, 2024

Having issues getting training to converge - hyper parameter issue? #8

Having issues getting training to converge - hyper parameter issue? #8

Comments

dayvidwang commented Mar 19, 2024

Environment:

Issue:

Attempts:

Request:

ysymyth commented Apr 5, 2024