GitHub - emmakopp1/MASH_Advanced_learning: Illustration of convergence results of advanced learning methods

emmakopp1 / MASH_Advanced_learning Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Illustration of convergence results of advanced learning methods

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ReadMe.rtf		ReadMe.rtf
TP_advanced.ipynb		TP_advanced.ipynb
outputs.pdf		outputs.pdf

Repository files navigation

Link of the book with the figures (Francis Bach) : https://www.di.ens.fr/~fbach/ltfp_book.pdf


Figure 3.1 :  Liner Least-squares regression
Setting : Ordinary Least Square

Polynomial regression with varying number of observations.

Figure 3.2 :  Liner Least-squares regression
Setting : Ordinary Least Square

Convergence rate for polynomial regression with error bars (obtained from 32 replications by adding/subtracting standard deviations), plotted in logarithmic scale, with fixed design (left plot). The large error bars for small n in the right plot are due to the lower error bar being negative before taking the logarithm. 


Figure p.90 : Optimization for machine learning
Setting : Ordinary Least Square

We consider two quadratic optimization problems in dimension d = 1000, with two different decays of eigenvalues (lambda_k) for the Hessian matrix H, one as 1/k (in blue below) and one in 1/k^2 (in red below), and for which we plot the performance for function values, both in semi-logarithm plots (left) and full-logarithm plots (right).
For slow decays (blue), we see the linear convergence kicking in, while for fast decays (red), the rates in 1/t dominate.

Figure p108  : Optimization for machine learning
Setting : SVM.

We consider a simple binary classification problem with linear predictors
and features with l2-norm bounded by R. We consider the hinge loss with a square l2-regularizer μ ∥ · ∥2. 
We measure the excess training objective. 
We consider two values of μ, and compare the two step-sizes γt = 1/(R2 t) and γt = 1/(μt). 
We see that for large enough μ, the strongly-convex step-size is better. This is not the case for small μ.

Figure 6.2 : Local Averaging Methods
Setting : Partition estimator

Regressograms in d = 1 dimension, with three values of |J| (the number of sets in the partition). 
We can observe both underfitting (|J| too small), or overfitting (|J| too large). 
Note that the target function f∗ is piecewise affine, and that on the affine parts, the estimator is far from linear, that is, the estimator cannot take advantage of extra-regularity. 

Figure 6.3 : Local Averaging Methods
Setting : Nearest neighbors.

k-nearest neighbor regression in $d = 1$ dimension, with 3 values of k (the number of neighbors). We can observe both underfitting (k too large), or overfitting (k too small).

Figure 6.4 : Local Averaging Methods
Nadaraya-Watson

Nadaraya-Watson regression in $d = 1$ dimension, with three values of h (the bandwidth), for the Gaussian kernel. 
We can observe both underfitting (h too large), or overfitting (h too small).