- Num epochs
- Max number of Epochs (15) with a very quick StopFunction (0.1 diff, patience 2) was reached only:
- With Hidden Width = 10 (clearly sub-optimal)
- With accuracy not approaching great results (range 0.8-0.9)
- With large batch sum (1000) - probably with smaller number reached best quicker
- Mostly with larger num layers (5) - does it mean that for smaller number gave up quicker or reached quicker?
- Number of epochs > 10, but < 15
- No great results .88-.95
- Mostly hidden width = 10 (a few with hidden width 64 gave relatively good results)
- Mostly large batch sum (1000), a few with small batch size (100) finished relatively quickly (11 epochs)
- 3 epochs only with worst result (0.1)
- 4-6
- some of the best results in all categories
- Most batch size 100 (smaller)
- Most num layers 5 (larger)
- Most Hidden width larger 64 (with smaller much worse results)
- 7-10
- Some of the best accuracy, but not best time
- Thoughts
- Seems that don't need more epochs
- With given hyperparameters, could even use less epochs, but since will make the StopFunction to be slower, need to leave this number and check again if enough / too much
- Max number of Epochs (15) with a very quick StopFunction (0.1 diff, patience 2) was reached only:
- Average epoch time
- Range 6 - 11
- Worse with smaller batch size, better with larger batch size
- < 7
- Larger batch size (1000)
- Usually worse total times and larger number of iterations
- > 9
- Small batch sizes (100)
- Large number of layers (5)
- Some of the best accuracies
- Thoughts:
- Not a helpful metric, since has high correlation to batch sizes - could be better fewer slower epochs with smaller number of batch sizes
- Hidden width [10,64]
- 10
- Never gives good accuracy
- Sometimes gives good times, but not with good accuracy
- Thoughts:
- Need to enlarge, see if more than 64 is better, or less
- 10
- Batch size [100,1000]
- 1000 - never gives best results (perhaps because of quick StopFunction)
- Thoughts:
- 100 seems better than 1000, need to try something in-between
- Num layers [4, 5]
- Best accuracy results both in 4 and 5, but more in 5
- Time-wise, 5 seems to be better
- Thoughts:
- Can't give up on 4 completely
- Some time in the future, need to try 3 to see if it's much worse than 4
- 6 might possibly be better - need to try
- Accuracy
- < .9
- Large batch size (1000)
- Small hidden width size (10)
- A lot of sigmoid functions
- Also usually large times
- [.978 - .987] - best
- All batch size 100
- All width 64
- Usually comes with pretty good times too
- 4-8 epochs
- Shortest time - 30 sec, longest - 90 sec
- > .95, and <.978
- Large number of batch size 1000 !
- All width 64
- Usually comes with pretty good times too
- 4-9 epochs
- Thoughts:
- Can dismiss width 10
- Batch size 100 seems better, but can't completely dismiss 1000
- < .9
- Train time
- 21 - 107 seconds range
- > 100
- 15 epochs
- Batch size 1000
- Hidden width 10
- None of the best times
- [84, 100]
- Mostly large number of epochs (13-15)
- Mostly batch size 1000
- Mostly not great accuracies
- 21 - didn't learn
- [30-34]
- With batch size 100 and Hidden width 64
- some of the best accuracies too and some of the best efficiencies
- smallest number of epochs (4-5)
- With batch size 100 and Hidden width 64
- [34-50]
- Mostly batch size 100
- Good results with hidden width 64
- Mostly 4-7 epochs
- Good ones 7-8 seconds per epoch
- Efficiency (accuracy / time)
- 0.005 to .032
- [0.31,0.32]
- Batch size 100
- Num layers varies
- Funcs relu, tanh, [tanh]
- Number of epochs - 4
- [0.25, 0.26]
- Batch size 100
- Num layers - 5 (longer to reach than varying 4 above)
- Hidden width - 64
- Functions -
relu
first, but sometimes alsotanh
- < .01
- Batch size 1000
- Usually num layers = 5, and not 4
- Hidden width = 10
- Usually 15 epochs
- Num layers - not clear what's better - 5 sometimes gives better results but usually slower. Try 3 to prove than worse, try 6 to prove that worse than 5 because of time it takes
- Batch size - 100 is usually better than 1000 but not by knockout - try something in the middle, and lower than 100
- Hidden width - 64 is clearly better than 10, need to try higher and lower than 64
- Hidden funcs - 'relu' and 'tanh' should not be removed, possibly need to add 'softmax'
- ACCURACY_IMPROVEMENT_DELTA, ACCURACY_IMPROVEMENT_PATIENCE - 0.01 and 2. Patience leave 2 since cycles take a lot of time. Better to have delta 0.001 at least, but will take much longer, so for brute solutions, leave 0.01
- MAX_NUM_EPOCHS - was 15, leave 15 for if doing delta 0.001, can put 10 if for now staying with delta 0.01
- Layers
- Layer 3
- Some of the worst accuracies (92-96)
- didn't use all 10 epochs
- Some of the best efficiencies because of good times
- Layer 4
- Not great accuracies
- Layer 5
- Some of the better accuracies and efficiencies
- Layer 6
- Mixed - sometimes best, sometimes worst in all categories
- Layer 3
- Accuracy
- Best - Some 4, some 5, some 6 layers, none 3
- Top 1/3 - 5 and 6
- Top 10% - all 6
- 4 appears in top 40% - .011 difference from best
- Efficiency (accuracy / time)
- Top 10% already has 6 and 5
- Top 27% has 4 already, but not with best result
- With the given other parameters, 5 seems best, but 4 and 6 should be left in the trials (also interesting to check if 3 and 7 for sure worse)
- Accuracy
- 100s - vast majority in top half, a few from 500 and 1000 either take longer, or not the top accuracy
- Efficiency (accuracy/time):
- Top half are 100s with best accuracies, or 500 / 1000 not with best accuracy
- 100 seems better than alternatives, look around 100
- Batch size 500 is never best both in accuracy and efficiency together
- Top half in Accuracy and Efficiency Efficiency (accuracy/time):
- Batch sizes 50, 100, 250 have all some of the best results
- 250 is not in both accuracy and time results
- Need to rerun with not such a quick StopFunction - at least 0.001, and 15 iterations
- Can drop 500, can introduce a number between 50 and 100, and 100 and 250
- Also noticed that my functions didn't give all options, rerunning also for that reason
Conclusions 5 - batch size: [50,75,100,170,250] + much slower StopFunction + all function variations
- Seems to have some overfitting on validate accuracy
- Some of the best results it wasn't enough to have 25 epochs
- Batch
- 50 - some of the best accuracies and efficiencies
- 75 - some of the best accuracies and efficiencies
- 100 - same
- 170 & 250 - some of the best accuracies, but not the most efficient
- Accuracy
- absolutely best is with 100
- some of the best in each one of the categories: 50,75,100,170,250
- Efficiency
- Some of the best in all categories besides 250, but accuracies are not the best
- Accuracy product
- Best .9945
- Some of the best in all categories, including 250
- Accuracy product per time
- Some of the best in all categories, besides 250, but accuracies not good
- Best both accuracy product and Accuracy Product per Time
- Accuracy Product - around .99 (.9905-.9912) with best being .9945
- Add 2 new metrics: product of accuracies and efficiency of product (product of accuracies / time) - to make sure there is no overfitting on validate accuracy
- Enlarge number of epochs to 30 since some of the best results it wasn't enough to have 25 epochs
- The whole range seems to produce some of the best results, with batch 250 possibly less so. Perhaps run 150 as default, and 50 and 250 as extras from now on
- Next step: try different number of widths: 25, 50, 75
Conclusions 6 - hidden width size: [25,50,75] + much slower StopFunction + all function variations
- Epochs - max 25 - Some of the best results reached maximum of 25 epochs, but also many of the bad ones
- Width 25 - some of the worst in all parameters
- Width 50 - some of the best in accuracies, but not so much efficiency
- Width 75 - best accuracies, some of the best efficiencies
- Accuracy - Most of best width 75, some 50
- Product accuracy - best 0.9956 - vast majority 75
- Width 75 seems best, need to check if going up helps (100?)
- For now setting on 75
- Next run: try if 4 layers give drastically worse results
- Num epochs - seems that 25 was enough
- Got 0.9964 product accuracy - so 4 layers seems more than enough
- Leave 4 layers for now, play with other parameters - fine tune other parameters on 4 layers
Conclusions 8 - 4 layers with different options of all 4 activation functions, batch_sizes = [100, 150, 200], hidden_widths = [60, 75, 100],
- Number of epochs - 25
- 1 of the best ones didn't finish, but most didn't need more, and it seems that have plenty similar that did
- A third almost reached 25 epochs, so it's good we stopped at 25, otherwise it would take much longer
- Activation functions
- softmax is not helpful ever
- sigmoid is never first, but in some cases not a lot behind. Can for now remove as first, since plenty similar
- Accuracies Product - best .9993 (Validate 0.9998, Train 0.9995)
- Top 5% - .9981-.9993
- Different batch sizes [100, 150, 200], but more 200, and where not 200, there is another similar result in 200
- Hidden widths all largest 100
- First activation function is always relu / tanh
- Best: Batch size 200, Hidden funcs - ('tanh', 'relu'), Hidden width - 100
- Top 5% - .9981-.9993
- If would continue:
- Width - possibly more than 100 could do better?
- Batch size - 200 is best or same, perhaps try higher?
- Remove softmax function altogether, remove sigmoid as first
- Need to check with endless iterations, but batch size largest. Possibly same or better result, but longer?
- Need to check with different seed to make sure what was chosen was not luck for the specific split
- Need to check without batch sizes at all but with endless epochs to see if always get better and consistent results
Conclusions 9 - running single function that was found to be the best numerous times with different seeds to check if it's consistent
- Ran 5 times, Accuracies Product 4/5 .995 and above, but 1/5 .98, with 1 being .9993
- Check if it's a matter of batching, and whether it's better perhaps not to batch to get best results
- Took very very long, and results (at least with current StopFunction) are very bad - accuracy 0.92
- Decided that batches are needed, even if they are not very small
- Try with even slower StopFunction (0.0001 and not 0.001 delta that was used till now)
- Need to check again if reading the data, or preparing the data causes the fluctuation between results
- Try with much larger batches than 100 or 150 used till now - perhaps could have better results without paying too much in time
Conclusions 11 - running with even slower StopFunction (0.0001 and not 0.001 delta that was used till now)
- Number of epochs needed most: 24, usually lower
- Product accuracy - between .9946 and .9993 (average of 5: .9977)
- Training time - 2-3 minutes
- Extra time / extra number of epochs done due change of StopFunction is not drastic, but results improved, although they are not exactly the same
- Previous conclusion: - Need to check again if reading the data, or preparing the data causes the fluctuation between results
Conclusions 12 - reading the data moved to be performed every time in the loop, see if it makes differences larger (if data read from tensorflow_datasets different / different order every time)
- Accuracies Product - between .9976 and .9999, so it didn't help to move getting data inside the loop to get more different results
- Accuracies Product - one of them was 99.99! - which means making a mistake on 7 out of 70,000 results
- Previous conclusion: Try with much larger batches than 100 or 150 used till now - perhaps could have better results without paying too much in time
Conclusions 13 - working with batch number 500 and not 200 as till now - see if makes time worse/better, and if results stay the same
- Accuracies Product - 4/5 with .9999, and one with .9967 (and even that one with Validate Best of .9988 (.999 rounded) and Accuracy Train .9979 (rounded .998)
- Accuracies Validate - 4/5 with 1.0000! and Accuracy Train .9998 or .9999 (rounded to 1.000)
- Number of epochs - 22 to 33, for 4 better models around 32 on average
- Time - 3 minutes for the worse model, and ~3.5 minutes for the better ones - worse it to get a bit higher accuracy
- Try with batch size 1000
- Good but not as good results - accuracies products around .999 but not .9999 as before
- Try with 500, and then go back to 200 if 500 also not as good
- Good but not as good results - accuracies products around .999 but not .9999 as before
- Times - around same as 200
- Go back to 200 batch size
Default in tf.keras.optimizers.Adam is 0.001, tried with 0.0001
- Took longer - 400-600 seconds (vs 200-300 with 0.001)
- Many more epochs - 55-80
- Accuracy was not great - product .98-.99 (stop function stops before reached the best result?)
- Try with larger rate to see if will give same results as 0.001 but quicker. Try 0.02 as suggested in the lecture
- Results - accuracies are not good - product .92%
- Go back to much lower learning rates - try 0.005
- Results - still too fast - accuracy bad: product .98
- Time is better 1.5 minutes vs. 3-3.5 minutes
- Go back to much lower learning rates - try 0.002 - twice as much as default
- Results - still too fast - accuracy worse: product .995, but not .999
- Time took - slightly better than default learning rate 0.001
- Go back to much lower learning rates - go back to 0.001
- Much slower (700 sec instead of 300)
- Results not as good (accuracy product ~ .97-.985 instead of 99.5)
- Try with batch size 400 (twice as much as current 200)
- Same or faster (200-350 sec instead of 300-400)
- Accuracies same or better (accuracy product ~ .9995+ instead of .9985-.9995)
- Consider changing to this later, try 100 first
- Faster (100-200 sec instead of 300-400)
- Accuracies worse (accuracy product ~ .995+ instead of .9985-.9995)
- Change to 300
- Added testing - it gives lower results than expected - .97-.98, while validation and train accuracy is close to .9999
- Based on that, stopped updating the weights to best validity, and allowed to take last weights - simpler, and will allow to find best accuracy and patience
- Test accuracy: best .979, range .977-.979
- Product accuracy: best: .979, range .975-.979
- Validate and train accuracies: All .999 and up
- Train time - 205 seconds average, 170-230
- Num epochs 23-26
- Need another cycle with a lot of different parameters
- Test accuracy: best .981, range .977-.981 (same or better than 100)
- Product accuracy: best: .976, range .966-.976 (same or worse than 100)
- Validate and train accuracies: worse, .997-.998 (worse than 100)
- Train time - 145 seconds average (less than 205 of 100)
- Num epochs 13-16 - much less than 23-26 of 100
- 500 doesn't seem to be better, but can't be ruled out altogether. Try with 200
- Test accuracy: best .983, range .977-.983 (probably better than 100)
- Product accuracy: best: .983, range .968-.983 (same or better than 100)
- Validate and train accuracies: worse, .997-.998 (worse than 100)
- Train time - 210 seconds average (a bit more than 100)
- Num epochs 20-26 - less than 23-26 of 100
- Seems makes test better while train/validate worse. Leave 200
- Goal: double check that softmax doesn't help with 4 layers
- Test Accuracy is some of the worst with softmax functions
- However, that could be because our StopFunction is still too strict
- Remove for now softmax, work on making the StopFunction better, then try again with softmax
- Goal: check how to improve stop function - let it run for longer
- val_accuracy keeps on being the same, while val_loss and loss are still improving
- Start using val_loss as the stop function. Need to play with patience. Does it depend also on how fast we are learning? For example smaller batch size or larger learning rate = need to wait less to stop
- Try with var_loss that's any, and with patience 3
- Patience 3 is too small - stopping too early
- Try with patience 5
- Patience 5 seems much better, but having no limit on loss causes loss to decrease for a very long time, but then just up again, and that's the final result we get
- Test function accuracye ~ .983
- Limit delta to 0.00001
- 2/3 stopped too early
- 1/3 stopped too late, when there was a jump to a bigger loss
- Need more patience, but need to return to best result
- Product accuracy slightly better
- Test accuracy slightly better (2/3 similar .981, but 1 better - .983 instead of .974)
- Train time longer - 440 instead of 300
- Validate and train accuracies - all 1, while in patience 5 without returning .999-1
- More expensive, but gives better results
- Introduce finding best relevant results
- Goal - Stop slightly earlier, not to take that much time
- Test accuracy worse than patience 10 (.98-.983)
- Train time - average 273 - shorter than 10 wiht 0.00001
- Validate and train accuracies - .995 and above - worse, not 1 as we got with 10 and 0.00001
- Try 10 patience, but delta 0.0001 and not 0.00001
- Test accuracy .982 consistenly - similar to others before
- Train time - average 390 - saving 50 seconds from var_loss 0.00001
- Validate and train accuracies - mostly 1 (besides one case of .999)
- Validate loss - average .0002
- Train loss - average .0005
- Staying with this configuration for now (var loss delta 0.0001 and patience 10)
- Much worse - not getting close to best result
- Try with 0.0001
- Test accuracy .9802 consistenly
- Train time 1516 - much higher
- Validate and train accuracy - 1
- Accuracy is actually worse, so staying now with learning rate 0.001, but going forward to get better results, consider lowing learning rate from 0.001
Conclusions 36 - different functions for 4 layers (all 4 activation functions, with everything else set to best so far)
- Activation functions
- Softmax function clearly doesn't add anything - all results with it are bad
- 1st sigmoid - not great results, although with not with softmax not much after other good ones
- 1st relu/tanh - best, without major difference of what the function is (but not softmax)
- Absolutely best in every category: (not by much): ('relu', 'tanh'):
- test accuracy .9828
- test loss 0.0784
- time 200 on a very fast computer (like 380 on mine)
- Not include softmax in future tests, at least for 4 layers. Sigmoid is probably not critical, so can leave just relu and tanh, but can definitely help in 5 layers
- Run different parameters only with 'relu' and 'tanh' on 4 layers
Conclusions 37 - different functions for 5 layers (all 4 activation functions, with everything else set to best so far)
- Activation functions:
- Softmax - doesn't seem to add much, usually worse, one time where it's good, usually having a different function is same or better
- Sigmoid - gives same or better, when taken with relu and tanh together
- Test accuracy
- Best - .9842 test accuracy - ('relu', 'relu', 'tanh') with test loss .0812
- A lot of the rest of top 10% are combinations of all 3 functions or 2/3, or even only relu
- Test loss
- Best - .0715 test loss (with accuracy .9836) - ('tanh', 'tanh', 'sigmoid')
- Loss * time
- A few of the best:
- 2 above - ('relu', 'relu', 'tanh') with 17.4, and ('tanh', 'tanh', 'sigmoid') with 16.9
- ('tanh', 'relu', 'tanh') one of the best in all categories - test accuracy 0.9833, test loss 0.08, loss * time = 16.9
- A few of the best:
- Best 4 functions: Hidden funcs Test Accuracy Test Loss Loss * Time ('tanh', 'tanh', 'sigmoid') 0.9836 0.0715 16.9334 ('tanh', 'relu', 'tanh') 0.9833 0.08 16.9342 ('relu', 'tanh', 'sigmoid') 0.9833 0.0808 18.627 ('relu', 'relu', 'tanh') 0.9842 0.0812 17.4403
- Play around with configuration ('tanh', 'tanh', 'sigmoid'), perhaps with smaller learning rate, since seems train and validation loss were large, perhaps with smaller learning rate will do better
- Invest in different parameters for main combinations of tanh and relu
- Test accuracy - around .981 - not as good as 0.001
- Stay with learning rate of 0.001
Conclusions 39 - 5 layer - 'tanh', 'tanh', 'sigmoid' with a lot of patience and delta 0.00001 instead of 0.0001
- Test accuracy consistently 0.983
- Test loss 0.095 on average
- Train time 680 on average
- Giving even more patience doesn't necessarily help - got .9836 in previous results
- Compare to doing the same with 4 layer 'tanh', 'relu'
Conclusions 40 - 4 layer - 'tanh', 'relu', 'sigmoid' with a lot of patience and delta 0.00001 instead of 0.0001
- Test accuracy consistently 0.982
- Test loss 0.096 on average
- Train time 710 on average
- Giving even more patience doesn't necessarily help - got same in previous results
- With given parameters this is the best option for 4 layers
Conclusions 41 - 4 layers with different batch sizes, hidden widths and activation functions
- Test accuracy:
- Best 0.9847 Batch size Hidden funcs Hidden width 200 ('relu', 'relu') 200
- Top 10% - relu always first
- Top 10% - Batch sizes vary
- Top 10% - Hidden width - 300 almost everywhere
- Top 10%: Batch size Hidden funcs Hidden width Test Accuracy 200 ('relu', 'relu') 200 0.9847 200 ('relu', 'relu') 300 0.9843 200 ('relu', 'sigmoid') 300 0.9841 300 ('relu', 'sigmoid') 300 0.9839 400 ('relu', 'tanh') 300 0.9839
- Test loss
- Best: 0.0708 (average of best accuracies is .08) Batch size Hidden funcs Hidden width Test Accuracy Test Loss 300 ('tanh', 'tanh') 300 0.9831 0.0708
- Batch size and hidden width vary
- Try higher hidden width than 300
- Try around following values (accuracy .9847): Batch size Hidden funcs Hidden width 200 ('relu', 'relu') 200
- relu relu with 200 200 is the best option - test accuracy 0.9840 (rest are a bit less)
- So far for 4 layers relu relu with 200 200 is the best option (accuracy .9840)
- Wait for results of 4 layers with higher width than 300
- Wait for results of 5 layers with different parameters
- Goal: Saw that it's one of the best options in 5 layers, so tried it locally
- Test accuracy: .982-.983 - not as good as what I have with 4 layers
- So far the option with 4 layers is best
- relu relu slightly better with width 400 and batch 400 (.985 test accuracy)
- Up relu relu width and batch to 400, try higher on both
- Best test accuracy - top 5%
- Batch sizes vary
- Widths - mostly 300, the ones with 200, usually 300 is similar
- Best: 300 ('tanh', 'relu', 'relu') 300 with test accuracy of 0.9862 on AWS, however locally got only .982/.983
- 2nd best: 300 ('relu', 'tanh', 'sigmoid') 300 accuracy 0.9857
- Consider width > 300
- Consider 2nd best above
- 3 tries gave very different results of test accuracy - .9823 - .9854
- Can't consistently use this over 4 layers that gives more consistently .984
- Best test accuracy: 400 batch, 450 width: test accuracy 0.9848
- 2nd best: 350 batch, 400 width: test accuracy 0.9845
- 3rd best: 450 batch, 450 width
- Try batch 400, 450 with width 500
- Local test (not in AWS) - not exactly the same results in both
- Accuracy: .983-.984, 2/3 .984
- Compare to tanh, relu local 200, 200
- On AWS:
- Both 450 seems the best - test accuracy .9848. Both some are good with 400, 450, 500. So staying in the middle
- Change batch and width to 450
- Local test (not in AWS)
- Goal: compare to relu relu local
- Accuracy - on average .983 (.982-.984)
- relu relu seems better
- AWS run
- Top 5%: Hidden width Batch size Hidden funcs Test Accuracy 400 200 ('relu', 'tanh', 'sigmoid') 0.9852 400 300 ('relu', 'relu', 'relu') 0.9851 400 300 ('tanh', 'relu', 'sigmoid') 0.9856 400 400 ('tanh', 'relu', 'tanh') 0.9859
- Best: 400 400 ('tanh', 'relu', 'tanh') 0.9859, see next run locally
- Second - ('tanh', 'relu', 'sigmoid') that was tried and didn't give consistent better results locally
- sigmoid as 3rd with relu,tanh or tanh,relu gives great results, but doesn't seem extremely consistent
- Test locally the best run, compare to 4 layers best solution
- Local run
- Average test accuracy of .985, somewhat consistent. Possibly slightly better by .001 than 4 layer solution.
- Time: 410 local
- Test loss average: 0.675
- Not worth going to a 5 layer slower less stable solution for .001. Staying with 4 layers
- Local run
- Goal compare to 5 layers above
- Average test accuracy: .9835 (.983-.984)
- Time: 345 local
- Test loss: 0.08
- 5 layers model above seems to be slightly better, slower and less consistent. Staying with 4 layers.
- Local run
- Goal - try to have width at least as number of inputs (actually exactly in this case)
- Average test accuracy: .9843 (.983-.984)
- Time: 485 average local
- Test loss: 0.08
- Doesn't seem to improve by much from 450 width to 784. Leave 450, but as an option for running different options, leave in
- Local run
- Goal - try to have width at least as number of inputs (actually exactly in this case)
- Average test accuracy: .982 - worse than 450
- Time: 450 average local
- Test loss: 0.08
- Doesn't seem to improve by much from 450 width to 784. Leave 450, but as an option for running different options, leave in