Unsupervised Learning translation [webpage]
1. Unsupervised Learning cheatsheet
⟶
2. Introduction to Unsupervised Learning
⟶
3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.
⟶
4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:
⟶
5. Clustering
⟶
6. Expectation-Maximization
⟶
7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:
⟶
8. [Setting, Latent variable z, Comments]
⟶
9. [Mixture of k Gaussians, Factor analysis]
⟶
10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
⟶
11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:
⟶
12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:
⟶
13. [Gaussians initialization, Expectation step, Maximization step, Convergence]
⟶
14. k-means clustering
⟶
15. We note c(i) the cluster of data point i and μj the center of cluster j.
⟶
16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:
⟶
17. [Means initialization, Cluster assignment, Means update, Convergence]
⟶
18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:
⟶
19. Hierarchical clustering
⟶
20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.
⟶
21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:
⟶
22. [Ward linkage, Average linkage, Complete linkage]
⟶
23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]
⟶
24. Clustering assessment metrics
⟶
25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.
⟶
26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:
⟶
27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as
⟶
28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:
⟶
29. Dimension reduction
⟶
30. Principal component analysis
⟶
31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.
⟶
32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:
⟶
33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:
⟶
34. diagonal
⟶
35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.
⟶
36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:
⟶
37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
⟶
38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.
⟶
39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.
⟶
40. Step 4: Project the data on spanR(u1,...,uk).
⟶
41. This procedure maximizes the variance among all k-dimensional spaces.
⟶
42. [Data in feature space, Find principal components, Data in principal components space]
⟶
43. Independent component analysis
⟶
44. It is a technique meant to find the underlying generating sources.
⟶
45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:
⟶
46. The goal is to find the unmixing matrix W=A−1.
⟶
47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:
⟶
48. Write the probability of x=As=W−1s as:
⟶
49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:
⟶
50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:
⟶
51. The Machine Learning cheatsheets are now available in [target language].
⟶
52. Original authors
⟶
53. Translated by X, Y and Z
⟶
54. Reviewed by X, Y and Z
⟶
55. [Introduction, Motivation, Jensen's inequality]
⟶
56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]
⟶
57. [Dimension reduction, PCA, ICA]
⟶