Unsupervised Learning translation [webpage]

1. Unsupervised Learning cheatsheet

⟶

2. Introduction to Unsupervised Learning

⟶

3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.

⟶

4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:

⟶

5. Clustering

⟶

6. Expectation-Maximization

⟶

7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:

⟶

8. [Setting, Latent variable z, Comments]

⟶

9. [Mixture of k Gaussians, Factor analysis]

⟶

10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:

⟶

11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:

⟶

12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:

⟶

13. [Gaussians initialization, Expectation step, Maximization step, Convergence]

⟶

14. k-means clustering

⟶

15. We note c(i) the cluster of data point i and μj the center of cluster j.

⟶

16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:

⟶

17. [Means initialization, Cluster assignment, Means update, Convergence]

⟶

18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:

⟶

19. Hierarchical clustering

⟶

20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.

⟶

21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:

⟶

22. [Ward linkage, Average linkage, Complete linkage]

⟶

23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]

⟶

24. Clustering assessment metrics

⟶

25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.

⟶

26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:

⟶

27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as

⟶

28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:

⟶

29. Dimension reduction

⟶

30. Principal component analysis

⟶

31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.

⟶

32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:

⟶

33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:

⟶

34. diagonal

⟶

35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.

⟶

36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:

⟶

37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.

⟶

38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.

⟶

39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.

⟶

40. Step 4: Project the data on spanR(u1,...,uk).

⟶

41. This procedure maximizes the variance among all k-dimensional spaces.

⟶

42. [Data in feature space, Find principal components, Data in principal components space]

⟶

43. Independent component analysis

⟶

44. It is a technique meant to find the underlying generating sources.

⟶

45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:

⟶

46. The goal is to find the unmixing matrix W=A−1.

⟶

47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:

⟶

48. Write the probability of x=As=W−1s as:

⟶

49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:

⟶

50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:

⟶

51. The Machine Learning cheatsheets are now available in [target language].

⟶

52. Original authors

⟶

53. Translated by X, Y and Z

⟶

54. Reviewed by X, Y and Z

⟶

55. [Introduction, Motivation, Jensen's inequality]

⟶

56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]

⟶

57. [Dimension reduction, PCA, ICA]

⟶

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs-229-unsupervised-learning.md

cs-229-unsupervised-learning.md

Files

cs-229-unsupervised-learning.md

Latest commit

History

cs-229-unsupervised-learning.md

File metadata and controls