Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternating 1 bad 1 good during training #37

Open
tig3rmast3r opened this issue Jul 25, 2024 · 4 comments
Open

alternating 1 bad 1 good during training #37

tig3rmast3r opened this issue Jul 25, 2024 · 4 comments

Comments

@tig3rmast3r
Copy link

tig3rmast3r commented Jul 25, 2024

Hi Hugo,
i'm encountering a very strange behavior during training, basically it cycles validations giving 1 higher loss followed by a lower loss, and so on.
below are the validation loss from the latest training for example:
5.92
5.89
5.93
5.88
5.92
5.88
5.91
5.86
5.90

Next one will be "good" probably
consider that learning rate is fixed as i'm using rlrop optimizer
At first i was thinking there was something wrong audiodataset shuffle from audiotools, so i have disabled shuffle for the validation set and i have forced a reshuffke after each validation cycle using timestamp as seed to make sure that it will be different for each cycle, but i still have this behavior alternating one good and 1 bad.
Dataset/train loss also follows this behavior no matter if i reshuffle so i'm wondering if there is something else i'm not aware of or if there's something that doesn't works as expected during the shuffle.

in vampnet.yml i have the below settings:

AudioDataset.without_replacement: true
AudioLoader.shuffle: true
val/AudioLoader.shuffle: false

one training cycle is exactly 1 ephoc (90k+ chunks)

what i have noticed from the console:

AudioLoader(
  # scope = train
  sources : list = ['/home/tig3mast3r/vampnet/superbig']
  weights : NoneType = None
  relative_path : str = 
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 3879310638
)
build_transform(
  # scope = train
)
AudioDataset(
  # scope = train
  n_examples : int = 100000000
  duration : float = 10.0
  offset : NoneType = None
  loudness_cutoff : float = -30.0
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x737d1793ea10>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)

the output says there is a shuffle on the AudioLoader but on the Audiodataset is False.
Don't know if is related

what else i could look for ? it shouldn't behave like this assuming the randomness of provided training data.
thanks

@hugofloresgarcia
Copy link
Owner

Hi!

Does this behavior occur across different random seeds too? E.g. if you restarted training using a different random seed, would you notice this same loss pattern again? Also, are the actual loss values (not the pattern) different every time you reshuffle the data or are they the same?

Also, what does the train/val loss curve look like across all iterations? would you mind sharing some loss plots from tensorboard?

@tig3rmast3r
Copy link
Author

  • yes it does occur even if i do a dataset/train reshuffle after every validation step, i use timestamp as seed so is different each time.
  • i've created an external reshuffle to test if there was something wrong with audiotools, basically after every validation step i did a random name swap within the train dataset file names. it still behaves like that
  • val loss are slightly different each time, even if i have disabled shuffle on validation dataset i always get slightly different results if i validate the same checkpoint several times.

About tensorboard unfortunately i always trash logs folder for archived trainings, but i can provide the below csv for the latest training i did on my pc, every time there is a double value for the same iteration is because there was a resume.
i restarted training iterations at some points (kept only weights.pth)
With larger batch sizes (16 or 24, when i use vast.ai and multi gpu) i've noticed bigger patterns sometimes, like 3 bad and 1 good.
i've used rlrop for learning rate decay curve on this training but i've noticed this pattern also in the past with Noam.
outputfix.csv

Lastly, just for testing, i've resumed the above training changing batch size from 4 to 2, the issue is now much more evident, here's the log (there's a resume at iteration 98922):

[03:44:14] Loading checkpoint from 3020sven-24-b2/latest                           decorators.py:220
[03:50:21] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
           Best model so far                                                       decorators.py:220
[03:50:27] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                             Iteration 0                             ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.031342   │   0.031342                      
             accuracy-0-0.5/top1/unmasked          │   0.021346   │   0.021346                      
             accuracy-0-0.5/top25/masked           │   0.261384   │   0.261384                      
             accuracy-0-0.5/top25/unmasked         │   0.292282   │   0.292282                      
             accuracy-0.5-1.0/top1/masked          │   0.127371   │   0.127371                      
             accuracy-0.5-1.0/top1/unmasked        │   0.041262   │   0.041262                      
             accuracy-0.5-1.0/top25/masked         │   0.563686   │   0.563686                      
             accuracy-0.5-1.0/top25/unmasked       │   0.316748   │   0.316748                      
             loss                                  │   5.581545   │   5.581545                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.059813   │   1.059813                      
             other/learning_rate                   │   0.000090   │   0.000090                      
             time/train_loop                       │ 125.917442   │ 125.917442                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   5.014778   │   5.811567                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.031041                      
             accuracy-0-0.5/top1/masked            │        nan   │   0.043669                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.278061                      
             accuracy-0-0.5/top25/masked           │        nan   │   0.273002                      
             accuracy-0.5-1.0/top1/unmasked        │   0.023385   │   0.024148                      
             accuracy-0.5-1.0/top1/masked          │   0.134094   │   0.174240                      
             accuracy-0.5-1.0/top25/unmasked       │   0.289532   │   0.263240                      
             accuracy-0.5-1.0/top25/masked         │   0.537803   │   0.587320                      
             time/val_loop                         │  48.461155   │   0.191444                      
                                                   ╵              ╵                                 
           ⠏ Iteration (train) 1/2473050                         0:06:06 / -:--:--                  
           ⠏ Iteration (val)   0/583                             0:00:00 / 0:00:00                  
[06:53:40] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
           Best model so far                                                       decorators.py:220
[06:53:51] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                           Iteration 49461                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.056538   │   0.055277                      
             accuracy-0-0.5/top1/unmasked          │   0.027350   │   0.043594                      
             accuracy-0-0.5/top25/masked           │   0.359402   │   0.304963                      
             accuracy-0-0.5/top25/unmasked         │   0.305983   │   0.344635                      
             accuracy-0.5-1.0/top1/masked          │        nan   │   0.169889                      
             accuracy-0.5-1.0/top1/unmasked        │        nan   │   0.031789                      
             accuracy-0.5-1.0/top25/masked         │        nan   │   0.578831                      
             accuracy-0.5-1.0/top25/unmasked       │        nan   │   0.295991                      
             loss                                  │   5.743196   │   5.374013                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.216359   │   1.266699                      
             other/learning_rate                   │   0.000090   │   0.000090                      
             time/train_loop                       │   0.179107   │   0.178402                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   4.348414   │   5.526126                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.029400                      
             accuracy-0-0.5/top1/masked            │        nan   │   0.041464                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.275722                      
             accuracy-0-0.5/top25/masked           │        nan   │   0.264649                      
             accuracy-0.5-1.0/top1/unmasked        │   0.020939   │   0.022673                      
             accuracy-0.5-1.0/top1/masked          │   0.262431   │   0.147841                      
             accuracy-0.5-1.0/top25/unmasked       │   0.230964   │   0.251378                      
             accuracy-0.5-1.0/top25/masked         │   0.664365   │   0.545866                      
             time/val_loop                         │   0.056837   │   0.108232                      
                                                   ╵              ╵                                 
           ⠏ Iteration (train) 49462/2473050                   3:09:29 / 148:48:13                  
           ⠏ Iteration (val)   0/583                           0:00:00 / 0:00:00                    
[09:57:01] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
[09:57:06] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                           Iteration 98922                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.117355   │   0.054780                      
             accuracy-0-0.5/top1/unmasked          │   0.035052   │   0.043131                      
             accuracy-0-0.5/top25/masked           │   0.510193   │   0.303911                      
             accuracy-0-0.5/top25/unmasked         │   0.385567   │   0.334277                      
             accuracy-0.5-1.0/top1/masked          │   0.125369   │   0.198964                      
             accuracy-0.5-1.0/top1/unmasked        │   0.024364   │   0.032152                      
             accuracy-0.5-1.0/top25/masked         │   0.505900   │   0.618184                      
             accuracy-0.5-1.0/top25/unmasked       │   0.309322   │   0.299244                      
             loss                                  │   5.131485   │   5.686758                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.511617   │   1.245741                      
             other/learning_rate                   │   0.000090   │   0.000090                      
             time/train_loop                       │   0.179003   │   0.178563                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   6.730142   │   5.826073                      
             accuracy-0-0.5/top1/unmasked          │   0.000000   │   0.029817                      
             accuracy-0-0.5/top1/masked            │   0.002178   │   0.042216                      
             accuracy-0-0.5/top25/unmasked         │   0.000000   │   0.281199                      
             accuracy-0-0.5/top25/masked           │   0.063589   │   0.266826                      
             accuracy-0.5-1.0/top1/unmasked        │        nan   │   0.023898                      
             accuracy-0.5-1.0/top1/masked          │        nan   │   0.178145                      
             accuracy-0.5-1.0/top25/unmasked       │        nan   │   0.262936                      
             accuracy-0.5-1.0/top25/masked         │        nan   │   0.589535                      
             time/val_loop                         │   0.057392   │   0.109555                      
                                                   ╵              ╵                                 
           ⠙ Iteration (train) 98923/2473050 ╸                 6:12:45 / 144:34:17                  
           ⠙ Iteration (val)   0/583                           0:00:00 / 0:00:00                    
[04:08:00] Loading checkpoint from 3020sven-24-b2/latest                           decorators.py:220
[04:14:59] Loading checkpoint from 3020sven-24-b2/latest                           decorators.py:220
[04:20:40] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
[04:20:44] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                           Iteration 98922                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.083383   │   0.083383                      
             accuracy-0-0.5/top1/unmasked          │   0.019704   │   0.019704                      
             accuracy-0-0.5/top25/masked           │   0.442342   │   0.442342                      
             accuracy-0-0.5/top25/unmasked         │   0.256158   │   0.256158                      
             accuracy-0.5-1.0/top1/masked          │   0.073171   │   0.073171                      
             accuracy-0.5-1.0/top1/unmasked        │   0.070388   │   0.070388                      
             accuracy-0.5-1.0/top25/masked         │   0.432927   │   0.432927                      
             accuracy-0.5-1.0/top25/unmasked       │   0.447816   │   0.447816                      
             loss                                  │   5.470975   │   5.470975                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.042029   │   1.042029                      
             other/learning_rate                   │   0.000090   │   0.000090                      
             time/train_loop                       │ 112.012772   │ 112.012772                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   5.033357   │   5.823171                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.030750                      
             accuracy-0-0.5/top1/masked            │        nan   │   0.042924                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.276792                      
             accuracy-0-0.5/top25/masked           │        nan   │   0.270614                      
             accuracy-0.5-1.0/top1/unmasked        │   0.018931   │   0.023358                      
             accuracy-0.5-1.0/top1/masked          │   0.134807   │   0.172273                      
             accuracy-0.5-1.0/top25/unmasked       │   0.276169   │   0.260654                      
             accuracy-0.5-1.0/top25/masked         │   0.526391   │   0.584483                      
             time/val_loop                         │  44.559895   │   0.184471                      
                                                   ╵              ╵                                 
           ⠏ Iteration (train) 98923/2473050 ╸                   0:05:38 / -:--:--                  
           ⠏ Iteration (val)   0/583                             0:00:00 / 0:00:00                  
[07:23:45] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
[07:23:49] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                          Iteration 148383                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.040847   │   0.054950                      
             accuracy-0-0.5/top1/unmasked          │   0.059829   │   0.043299                      
             accuracy-0-0.5/top25/masked           │   0.314819   │   0.303498                      
             accuracy-0-0.5/top25/unmasked         │   0.427350   │   0.342117                      
             accuracy-0.5-1.0/top1/masked          │        nan   │   0.169778                      
             accuracy-0.5-1.0/top1/unmasked        │        nan   │   0.031349                      
             accuracy-0.5-1.0/top25/masked         │        nan   │   0.578397                      
             accuracy-0.5-1.0/top25/unmasked       │        nan   │   0.292735                      
             loss                                  │   5.940801   │   5.377327                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.548968   │   1.267261                      
             other/learning_rate                   │   0.000090   │   0.000090                      
             time/train_loop                       │   0.178283   │   0.178510                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   4.350002   │   5.526492                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.030643                      
             accuracy-0-0.5/top1/masked            │        nan   │   0.041276                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.284138                      
             accuracy-0-0.5/top25/masked           │        nan   │   0.264920                      
             accuracy-0.5-1.0/top1/unmasked        │   0.022208   │   0.022672                      
             accuracy-0.5-1.0/top1/masked          │   0.261050   │   0.148421                      
             accuracy-0.5-1.0/top25/unmasked       │   0.229695   │   0.251670                      
             accuracy-0.5-1.0/top25/masked         │   0.679558   │   0.545708                      
             time/val_loop                         │   0.059183   │   0.108220                      
                                                   ╵              ╵                                 
           ⠸ Iteration (train) 148384/2473050 ╸                3:08:43 / 141:29:35                  
           ⠸ Iteration (val)   0/583                           0:00:00 / 0:00:00                    
[10:26:51] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
[10:26:55] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                          Iteration 197844                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.107989   │   0.054989                      
             accuracy-0-0.5/top1/unmasked          │   0.016495   │   0.044490                      
             accuracy-0-0.5/top25/masked           │   0.479890   │   0.304197                      
             accuracy-0-0.5/top25/unmasked         │   0.274227   │   0.336267                      
             accuracy-0.5-1.0/top1/masked          │   0.146755   │   0.199176                      
             accuracy-0.5-1.0/top1/unmasked        │   0.031780   │   0.032188                      
             accuracy-0.5-1.0/top25/masked         │   0.573746   │   0.618924                      
             accuracy-0.5-1.0/top25/unmasked       │   0.309322   │   0.299661                      
             loss                                  │   5.113540   │   5.685070                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.068045   │   1.274413                      
             other/learning_rate                   │   0.000081   │   0.000081                      
             time/train_loop                       │   0.178242   │   0.178444                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   6.739600   │   5.825361                      
             accuracy-0-0.5/top1/unmasked          │   0.000000   │   0.028623                      
             accuracy-0-0.5/top1/masked            │   0.000871   │   0.042174                      
             accuracy-0-0.5/top25/unmasked         │   0.000000   │   0.264988                      
             accuracy-0-0.5/top25/masked           │   0.060105   │   0.266872                      
             accuracy-0.5-1.0/top1/unmasked        │        nan   │   0.023251                      
             accuracy-0.5-1.0/top1/masked          │        nan   │   0.178347                      
             accuracy-0.5-1.0/top25/unmasked       │        nan   │   0.255530                      
             accuracy-0.5-1.0/top25/masked         │        nan   │   0.590395                      
             time/val_loop                         │   0.058676   │   0.107920                      
                                                   ╵              ╵                                 
           ⠇ Iteration (train) 197845/2473050 ━                6:11:49 / 138:37:12                  
           ⠇ Iteration (val)   0/583                           0:00:00 / 0:00:00                    
[13:30:11] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
[13:30:15] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                          Iteration 247305                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │        nan   │   0.055067                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.043970                      
             accuracy-0-0.5/top25/masked           │        nan   │   0.304022                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.345185                      
             accuracy-0.5-1.0/top1/masked          │   0.223970   │   0.170392                      
             accuracy-0.5-1.0/top1/unmasked        │   0.032293   │   0.031604                      
             accuracy-0.5-1.0/top25/masked         │   0.698482   │   0.579534                      
             accuracy-0.5-1.0/top25/unmasked       │   0.319666   │   0.293879                      
             loss                                  │   4.319188   │   5.374097                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   1.391422   │   1.309109                      
             other/learning_rate                   │   0.000081   │   0.000081                      
             time/train_loop                       │   0.176921   │   0.179162                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   6.733348   │   5.532683                      
             accuracy-0-0.5/top1/unmasked          │   0.000000   │   0.029171                      
             accuracy-0-0.5/top1/masked            │   0.004367   │   0.041208                      
             accuracy-0-0.5/top25/unmasked         │   0.300000   │   0.280217                      
             accuracy-0-0.5/top25/masked           │   0.080349   │   0.265034                      
             accuracy-0.5-1.0/top1/unmasked        │        nan   │   0.022550                      
             accuracy-0.5-1.0/top1/masked          │        nan   │   0.146544                      
             accuracy-0.5-1.0/top25/unmasked       │        nan   │   0.250341                      
             accuracy-0.5-1.0/top25/masked         │        nan   │   0.543372                      
             time/val_loop                         │   0.058948   │   0.108363                      
                                                   ╵              ╵                                 
           ⠋ Iteration (train) 247306/2473050 ━╸               9:15:09 / 136:07:23                  
           ⠋ Iteration (val)   0/583                           0:00:00 / 0:00:00                    
[16:33:36] Saving to /home/tig3mast3r/vampnet11/vampnet                            decorators.py:220
[16:33:41] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
           ┃                          Iteration 296766                           ┃                  
           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛                  
                                            train                                                   
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             accuracy-0-0.5/top1/masked            │   0.006087   │   0.055028                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.044229                      
             accuracy-0-0.5/top25/masked           │   0.106957   │   0.304107                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.338769                      
             accuracy-0.5-1.0/top1/masked          │   0.333333   │   0.200244                      
             accuracy-0.5-1.0/top1/unmasked        │   0.018735   │   0.032348                      
             accuracy-0.5-1.0/top25/masked         │   0.769697   │   0.619920                      
             accuracy-0.5-1.0/top25/unmasked       │   0.242623   │   0.300489                      
             loss                                  │   6.452039   │   5.684814                      
             other/batch_size                      │   2.000000   │   2.000000                      
             other/grad_norm                       │   0.435003   │   1.324875                      
             other/learning_rate                   │   0.000073   │   0.000073                      
             time/train_loop                       │   0.179577   │   0.179045                      
                                                   ╵              ╵                                 
                                             val                                                    
                                                   ╷              ╷                                 
             key                                   │ value        │ mean                            
           ╶───────────────────────────────────────┼──────────────┼──────────────╴                  
             loss                                  │   4.054376   │   5.821749                      
             accuracy-0-0.5/top1/unmasked          │        nan   │   0.030596                      
             accuracy-0-0.5/top1/masked            │        nan   │   0.042155                      
             accuracy-0-0.5/top25/unmasked         │        nan   │   0.277808                      
             accuracy-0-0.5/top25/masked           │        nan   │   0.269246                      
             accuracy-0.5-1.0/top1/unmasked        │   0.016103   │   0.023027                      
             accuracy-0.5-1.0/top1/masked          │   0.270023   │   0.176346                      
             accuracy-0.5-1.0/top25/unmasked       │   0.224906   │   0.253517                      
             accuracy-0.5-1.0/top25/masked         │   0.750572   │   0.587035                      
             time/val_loop                         │   0.057713   │   0.108714                      
                                                   ╵              ╵                                 
           ⠏ Iteration (train) 296767/2473050 ━╸              12:18:35 / 132:37:49                  
           ⠏ Iteration (val)   0/583                          0:00:00  / 0:00:00                    

another strange thing is that if i redo a validation on the checkpoint that had 5.52 as loss i get a 5.8ish one.
looks like the "good" results are somewhat "fake"

@tig3rmast3r
Copy link
Author

image
here's a graph from the csv (cleaned up reduntant ones)
Below another graph from an older training, this issue almost disappears for larger batch sizes
image
and another one
image

@tig3rmast3r
Copy link
Author

tig3rmast3r commented Aug 28, 2024

about validation loss i did some more tests, basically the values given during training can't be used as exact metric to compare results.
if i redo a validation (using --resume keeping only weigths.pth without changing any other parameter), i get the same values if i repeat the test, with tolerance +-0.001, and results differs from the values given by the train loop.
even with higher batch size (i've rented a 4x 4090 and resumed the above training, so batch size at 20), so minimizing the issue above, results differs.
here's a comparison between values during training vs values got using a comparable validation:

step val during training comparable val delta
74k 5.741 5.757 -0.016
79k 5.759 5.761 -0.002
84k 5.734 5.757 -0.023
108k 5.760 5.752 0.008
123k 5.724 5.747 -0,023
133k 5.737 5.748 -0,011
138k 5.737 5.744 -0,007

so looking at the values while training it looks it goes up and down.
The truth is that is less jumpy and good values are not "that" good.
It may happen that the "best" checkpoint detected by the train loop may not be the right one.
in the above example it's still marking the 123k as the best one but in reality it has been beaten by 138k already.
Would be great if we can use a fixed validation setting during training that is detached from the train settings and not affected by dropout or current learning rate or other parameters in order to have a more consistent result.

Lastly, this behavior affects both train and val loops, as i always have almost the same delta between train and val losses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants