- Used polynomial kernel of the form k(x, x') = (<x, x'>)p, where p is the parameter or degree of polynomial.
- Suggested learning rate is eta = 1/sqrt(i) where i is the ith sample while iterating through the training data.
- Matrix Delta is the amount of penalty you would pay in case of wrong prediction.
- Case 1: you pay 0 for each correct prediction and 1 for each wrong one
- Case 2: you pay 0 for each correct prediction, 1 for each wrong prediction between classes whose digits are one number apart one from the other (e.g. you predicted "2" and the correct label is "3"), and pay 2 for all the other cases.
- 0/1 loss for Delta1 = 8.52%
- 0/1 loss for Delta2 = 8.31%
- Hence we see that when there are no kernels, larger penalizing or using different Delta's helped in reducing loss. But when there are kernels it didn't matter.
- Also attached without_Kernels_confusionMatrix_Delta1.png and without_Kernels_confusionMatrix_Delta2.png.
- 0/1 loss for both the deltas is the same = 3.75% It didn't matter what deltas were defined.
- The loss run on 100 samples was 37%. On 1000 samples was 14% and 10000 samples was 6.33%. Loss was reducing very fast. It didn't matter much after certain number of samples.
- Also attached confusion_matrix.png which is same for both the deltas.