Skip to content

Illustration of convergence results of advanced learning methods

Notifications You must be signed in to change notification settings

emmakopp1/MASH_Advanced_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Link of the book with the figures (Francis Bach) : https://www.di.ens.fr/~fbach/ltfp_book.pdf


Figure 3.1 :  Liner Least-squares regression
Setting : Ordinary Least Square

Polynomial regression with varying number of observations.

Figure 3.2 :  Liner Least-squares regression
Setting : Ordinary Least Square

Convergence rate for polynomial regression with error bars (obtained from 32 replications by adding/subtracting standard deviations), plotted in logarithmic scale, with fixed design (left plot). The large error bars for small n in the right plot are due to the lower error bar being negative before taking the logarithm. 


Figure p.90 : Optimization for machine learning
Setting : Ordinary Least Square

We consider two quadratic optimization problems in dimension d = 1000, with two different decays of eigenvalues (lambda_k) for the Hessian matrix H, one as 1/k (in blue below) and one in 1/k^2 (in red below), and for which we plot the performance for function values, both in semi-logarithm plots (left) and full-logarithm plots (right).
For slow decays (blue), we see the linear convergence kicking in, while for fast decays (red), the rates in 1/t dominate.

Figure p108  : Optimization for machine learning
Setting : SVM.

We consider a simple binary classification problem with linear predictors
and features with l2-norm bounded by R. We consider the hinge loss with a square l2-regularizer μ ∥ · ∥2. 
We measure the excess training objective. 
We consider two values of μ, and compare the two step-sizes γt = 1/(R2 t) and γt = 1/(μt). 
We see that for large enough μ, the strongly-convex step-size is better. This is not the case for small μ.

Figure 6.2 : Local Averaging Methods
Setting : Partition estimator

Regressograms in d = 1 dimension, with three values of |J| (the number of sets in the partition). 
We can observe both underfitting (|J| too small), or overfitting (|J| too large). 
Note that the target function f∗ is piecewise affine, and that on the affine parts, the estimator is far from linear, that is, the estimator cannot take advantage of extra-regularity. 

Figure 6.3 : Local Averaging Methods
Setting : Nearest neighbors.

k-nearest neighbor regression in $d = 1$ dimension, with 3 values of k (the number of neighbors). We can observe both underfitting (k too large), or overfitting (k too small).

Figure 6.4 : Local Averaging Methods
Nadaraya-Watson

Nadaraya-Watson regression in $d = 1$ dimension, with three values of h (the bandwidth), for the Gaussian kernel. 
We can observe both underfitting (h too large), or overfitting (h too small).

About

Illustration of convergence results of advanced learning methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published