Skip to content

Latest commit

 

History

History
344 lines (173 loc) · 6.76 KB

cs-229-unsupervised-learning.md

File metadata and controls

344 lines (173 loc) · 6.76 KB

Unsupervised Learning translation [webpage]


1. Unsupervised Learning cheatsheet


2. Introduction to Unsupervised Learning


3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.


4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:


5. Clustering


6. Expectation-Maximization


7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:


8. [Setting, Latent variable z, Comments]


9. [Mixture of k Gaussians, Factor analysis]


10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:


11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:


12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:


13. [Gaussians initialization, Expectation step, Maximization step, Convergence]


14. k-means clustering


15. We note c(i) the cluster of data point i and μj the center of cluster j.


16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:


17. [Means initialization, Cluster assignment, Means update, Convergence]


18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:


19. Hierarchical clustering


20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.


21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:


22. [Ward linkage, Average linkage, Complete linkage]


23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]


24. Clustering assessment metrics


25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.


26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:


27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as


28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:


29. Dimension reduction


30. Principal component analysis


31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.


32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:


33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:


34. diagonal


35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.


36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:


37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.


38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.


39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.


40. Step 4: Project the data on spanR(u1,...,uk).


41. This procedure maximizes the variance among all k-dimensional spaces.


42. [Data in feature space, Find principal components, Data in principal components space]


43. Independent component analysis


44. It is a technique meant to find the underlying generating sources.


45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:


46. The goal is to find the unmixing matrix W=A−1.


47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:


48. Write the probability of x=As=W−1s as:


49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:


50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:


51. The Machine Learning cheatsheets are now available in [target language].


52. Original authors


53. Translated by X, Y and Z


54. Reviewed by X, Y and Z


55. [Introduction, Motivation, Jensen's inequality]


56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]


57. [Dimension reduction, PCA, ICA]