Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering in ML #6

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
643 changes: 643 additions & 0 deletions index.ipynb

Large diffs are not rendered by default.

Binary file added notebooks/clustering/Images/clustering.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
348 changes: 348 additions & 0 deletions notebooks/clustering/index.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<div align=center>\n",
"\t\t\t\t<p></p>\n",
"\t\t\t\t<p></p>\n",
"In the Name of God\n",
" <p></p>\n",
"<br>\n",
" Sharif University of Technology\n",
" <br>\n",
"Computer Engineering Department\n",
" <p></p>\n",
"Artificial Intelligence Course\n",
" <br />\n",
"\t\t\t<br />\n",
" MohammadHossein Rohban\n",
" <br />\n",
"Fall 2021\n",
" </div>\n",
"\t\t<hr/>\n",
"\t\t\t<div align=center>\n",
"Clustering in Machine Learning\n",
" </div>\n",
"\t\t<br />\n",
"\t\t<div align=center>\n",
"Sepehr Amini Afshar\n",
" </div>\n",
"\t\t<hr />\n",
"\t\t<style type=\"text/css\" scoped>\n",
" p{\n",
" border: 1px solid #a2a9b1;background-color: #f8f9fa;display: inline-block;\n",
" };\n",
" </style>\n",
"\t\t<div>\n",
"\t\t\t<h3>Table of Contents</h3>\n",
"\t\t\t<ul style=\"margin-right: 0;\">\n",
"\t\t\t\t<li>\n",
" <a href=\"#sec_intro\">\n",
" Introduction\n",
" </a>\n",
" </li>\n",
" <li>\n",
"\t\t\t\t\t<a href=\"#clustering\">\n",
" Clustering\n",
" </a>\n",
"\t\t\t\t</li>\n",
" <li>\n",
"\t\t\t\t\t<a href=\"#part-base-clust\">\n",
" Partition-Based Clustering\n",
" </a>\n",
"\t\t\t\t</li>\n",
" <li>\n",
"\t\t\t\t\t<a href=\"#hier-clust\">\n",
" Hierarchical Clustering\n",
" </a>\n",
"\t\t\t\t</li>\n",
" <li>\n",
"\t\t\t\t\t<a href=\"#dbscan-clust\">\n",
" Density-Based Clustering\n",
" </a>\n",
"\t\t\t\t</li>\n",
" <li>\n",
"\t\t\t\t\t<a href=\"#em-gmm-clust\">\n",
" Expectation Maximization (EM) clustering using Gaussan Mixture Models (GMM)\n",
" </a>\n",
"\t\t\t\t</li>\n",
" <li>\n",
"\t\t\t\t\t<a href=\"#sec_refs\">\n",
" References\n",
" </a>\n",
"\t\t\t\t</li>\n",
"\t\t\t</ul>\n",
"\t\t</div>\n",
"\t</font>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br />\n",
"<div id=\"sec_intro\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
"Introduction\n",
" </font>\n",
"\t\t<p></p>\n",
"\t\t<hr>\n",
" <a href=\"https://www.geeksforgeeks.org/clustering-in-machine-learning/\">Clustering</a> is basically a type of <a href=\"https://www.geeksforgeeks.org/supervised-unsupervised-learning/\">unsupervised learning</a> method. It is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. <br>\n",
" in this notebook we will introduce some algorithms for clustering and note some of their characteristics.\n",
" <br>\n",
" <br/>\n",
" <img src=\"images/clustering.jpg\" height=800 width=700>\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br />\n",
"<div id=\"clustering\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
" Clustering\n",
" </font>\n",
"\t\t<p></p>\n",
"\t\t<hr>\n",
"There are different approaches to clustering. We could differentiate these sub-categories as follows.\n",
" <ul>\n",
" <li> Porbablistic methods\n",
" <li> Non-Probablistic methods \n",
" <ul>\n",
" <li> Partition-based clustering\n",
" <li> Hierarchical clustering\n",
" <li> Density-based clustering\n",
" <ul/>\n",
" <ul/>\n",
" <br>\n",
" <br/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br />\n",
"<div id=\"part-base-clust\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
"Partition-Based Clustering\n",
" </font>\n",
"\t\t<p></p>\n",
"\t\t<hr>\n",
"Partition-based clustering techniques try to create partitions of data based on a distance measurement applied to data points. The most common algorithm of this approach is <a href=\"https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1\">k-means</a> clustering. <br>\n",
" K-means clustering tries to minimize distances within a cluster and maximize the distance between different clusters. K-means algorithm is not capable of determining the number of clusters. We need to define it when creating the KMeans object which may be a challenging task. <br>\n",
" K-means is an iterative process. It is built on <a href=\"https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm\">expectation-maximization algorithm</a>. After number of clusters are determined, it works by executing the following steps:\n",
" <ol>\n",
" <li> Randomly select centroids (center of cluster).\n",
" <li> Calculate distance of all data points to all centroids.\n",
" <li> Assign each data point to the closest centroid. data points assigned to a centroid form a cluster.\n",
" <li> Find the new centroids of each cluster by taking the mean of all data points in the cluster.\n",
" </ol>\n",
" Repeat 2-4 until convergence. Convergence could happen when the centroids stop changing or when the total cost function reaches a minimum threshold.\n",
"\n",
"**Pros And cons** \n",
"\n",
"K-Means has the advantage that it’s pretty fast, as all we’re really doing is computing the distances between points and group centers; very few computations! It thus has a linear complexity O(n). Although it may sound fast, but as the number of data points grow larger this time becomes very noticable. because ve run O(n) computation at each iteration. but the number of iterations may vary. <br>\n",
"On the other hand, K-Means has a couple of disadvantages. Firstly, you have to select how many groups/classes there are. This isn’t always trivial and ideally with a clustering algorithm we’d want it to figure those out for us because the point of it is to gain some insight from the data. K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent. It is also sensitive to outliers.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br />\n",
"<div id=\"hier-clust\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
"Hierarchical clustering\n",
" </font>\n",
"\t\t<p></p>\n",
"\t\t<hr>\n",
" In this approach we can start with a single cluster and dividing it to smaller clusters, or assume that each point is a cluster and combining smaller clusters to reach a single cluster. These mehtods are Divisive clustering and Agglomerative clustering respectively. <br> \n",
" \n",
"Hierarchical clustering is useful and gives better results if the underlying data has some sort of hierarchy.\n",
"<br>\n",
" \n",
"In agglomerative algorithm, it is not wise to combine all data points into one cluster. The termination constraint could be reducing number of clusters to a certain number or define a distance threshold and when two clusters have a bigger distance than the threshold we do not combine those clusters. \n",
"\n",
"Some common applications of hierarchical clustering:\n",
"<ul>\n",
"<li>Genetic or other biological data can be used to create a dendrogram to represent mutation or evolution levels. <a href=\"https://en.wikipedia.org/wiki/Phylogenetic_tree\">Phylogenetic trees </a> are used to show evolutionary relationships based on similarities and differences.\n",
"<li>Hierarchical clustering is also used for <a href=\"https://www2.cs.sfu.ca/~ester/papers/Encyclopedia.pdf\">grouping text documents</a>.\n",
"<li>Another common use case of hierarchical clustering is <a href=\"https://www.sciencedirect.com/science/article/abs/pii/S0020025515002790\">social network analysis</a>.\n",
"<li>Hierarchical clustering is also used for <a href=\"https://link.springer.com/article/10.1007/s10994-020-05905-4\">outlier detection</a>.\n",
"</ul>\n",
"\n",
"**Pros and Cons** <br>\n",
"You do not have to specify the number of clusters beforehand. \n",
"and it always generates the same clusters. K-means clustering may result in different clusters depending on the how the centroids (center of cluster) are initiated.\n",
"But It is a slower algorithm compared to k-means. Hierarchical clustering takes long time to run especially for large data sets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br />\n",
"<div id=\"dbscan-clust\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
"Density-Based clustering\n",
" </font>\n",
"\t\t<p></p>\n",
"\t\t<hr>\n",
" The noted methods are efficient when our clusters shape are relatively simple and without outliers. They do not have a mechanism to distinguish outliers. in some tasks e.g. anomaly detection we require to find outliers. Density-based methods will find dense regions and outliers. DBSCAN (density-based spatial clustering of applications with noise) works as follows:\n",
"\n",
" \n",
"We define $eps$ and $minPts$ for the algorithm. $eps$ is maximum distance between two neighbor points. $minPts$ are minimum number of points to define a cluster.\n",
" \n",
"There are some key concept for this algorithm.\n",
" \n",
"* Core Point: a point is core point if there are at least $minPts$ points (including the point itself) in its surrounding area with radius $eps$.\n",
"\n",
"* Border Point: a point which is reachable from a core point and have less than $minPts$ point in its surrounding area.\n",
" \n",
"* Outlier Point: a point which is not a core point and not reachable from any core point. \n",
" \n",
"After determining the $eps$ and $minPts$ the algorithm goes as follows.\n",
"\n",
"1. start from a random point. label it visited. check if it is a core point or not. if it is a core point we have a cluster which for example we call it A. then all the neighbors of the point goes into the queue and we porform 2 for them. if the poinst isn't a core point. mark it as a noise. a noise doesn't belong to any cluster.\n",
" \n",
"2. pop a point from the queue. label it visited. assign it to current cluster (in this example A). add all of its unvisited neighbors to the queue. After the queue finished. select a unvisited point and perform 1.\n",
" \n",
"**Pros And Cons**\n",
"\n",
"This method does not require to specify number of clusters beforehand and also performs well with arbitrary shapes clusters.\n",
"as we described DBSCAN is robust to outliers and able to detect the outliers. but determining the best $eps$ and $minPts$ isn't always a trivial task. and because these two parameters are constants the algorithm cannot perform well on the clustering tasks which the clusters with much different densities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br />\n",
"<div id=\"em-gmm-clust\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
"Expectation Maximization (EM) clustering using Gaussan Mixture Models (GMM)\n",
" </font>\n",
"\t\t<p></p>\n",
"\t\t<hr> \n",
"This methods is a probablistic method. One of K-means problem is using mean for deciding the center of the cluster. because of this assumption we are only able to find clusters which their shape is simple circles. GMM is more flexible than K-Means. it assumes that each cluster is a gaussian model in that space. then it tries to find the best mean and standard deviation for each cluster. It does it in an iterative manner. It performs two steps iteratively. Expectation step and maximization step. in Expectation step it tries to findout how likely each point belongs to each cluster. and in maximization step it updates the mean and standard deviation according to the new data points.\n",
" \n",
"The math work behind the above explanation goes as below.\n",
" \n",
"The E step:\n",
" \n",
"$$ p(x) = \\sum_{k=1}^{K} \\pi_k N(x| \\mu_k, {\\sum}_k) $$\n",
" \n",
"The above formula assumes that each data point is a weighted sum of gaussian (Normal) distributions. the number of $k$ is predefined and it defines the number of clusters.\n",
" \n",
"$$ \\gamma(z_{nk}) = \\frac{\\pi_k N(x_n|\\mu_k, {\\sum}_k)}{\\sum_{j=1}^{K}\\pi_j N(x_n|\\mu_j, {\\sum}_j)} $$\n",
" \n",
"In above formula $z_{nk}$ is a latent variable which implies if the $x_n$ belongs to ${cluster}_k$. the above formula shows the responsibility of the $kth$ cluster for the $nth$ data point. \n",
" \n",
"The M step:\n",
"\n",
"After finding the posterior of all datapoints we need to estimate the mean and standard deviation for each of out distributions.\n",
" \n",
"$$ N_k = \\sum_{n=1}^{N} \\gamma(z_{nk}) $$\n",
"\n",
"$$ \\pi_k^{new} = \\frac{N_k}{N}$$ where N is total number of data points.\n",
" \n",
"$$ \\mu_k^{new} = \\frac{1}{N_k} \\sum_{n=1}^{N} \\gamma(z_{nk})x_n $$\n",
"$$ {\\sum}_k^{new} = \\frac{1}{N_k} \\sum_{n=1}^{N} \\gamma(z_{nk})(x_n - \\mu_k^{new})(x_n - \\mu_k^{new})^T $$\n",
"\n",
"All of thses steps are taken to maximize the marginal log likelihood defined as below.\n",
"$$ln p(X|\\mu,\\sum ,\\pi) = \\sum_{n=1}^N ln \\{ \\sum_{k=1}^K \\pi_k N(x_n|\\mu_k , {\\sum}_k) \\} $$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p></p>\n",
"<br/>\n",
"<div id=\"sec_refs\" style=\"direction:ltr;line-height:300%;\">\n",
"\t<font face=\"Arial\" size=5>\n",
"\t\t<font color=#888888 size=6>\n",
"References\n",
" </font>\n",
"\t\t<hr> \n",
" <ul>\n",
" <li>\n",
"<a href=\"https://www.geeksforgeeks.org/clustering-in-machine-learning/\">Clustering in Machine Learning </a> \n",
" </li>\n",
" <li>\n",
"<a href=\"https://www.geeksforgeeks.org/supervised-unsupervised-learning/\">Supervised And Unsupervised Learning </a>\n",
" </li>\n",
" <li>\n",
"<a href=\"https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1\"> Understanding K-means Clustering in Machine Learning</a>\n",
" </li>\n",
" <li>\n",
"<a href=\"https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm\">Expectation-Maximization Algorithm</a>\n",
" </li>\n",
" <li>\n",
"<a href=\"https://towardsdatascience.com/gaussian-mixture-modelling-gmm-833c88587c7f\">Gaussian Mixture Modeling</a>\n",
" </li> \n",
" <li>\n",
"<a href=\"https://machinelearningmastery.com/expectation-maximization-em-algorithm/\">A Gentle Introduction to Expectation-Maximization (EM Algorithm)</a>\n",
" </li> \n",
" <li>\n",
"<a href=\"https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68\">The 5 Clustering Algorithms That Data Scientists Need To Know</a>\n",
" </li>\n",
" <li>\n",
"<a href=\"https://towardsdatascience.com/top-machine-learning-algorithms-for-clustering-a09c6771805\">Top Machine Learning Algorithms for Clustering</a>\n",
" </li>\n",
" </ul> \n",
"\t</font>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
30 changes: 30 additions & 0 deletions notebooks/clustering/metadata.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
title: Clustering in Machine Learning

meta:
- name: keywords
content: Artificial Intelligence, Prolog, Clustering

header:
title: Clustering in Machine Learning
description: |
In this notebook we talk about clustering, its variants and explain those variants.
authors:
label:
position: top
text: Authors
kind: people
content:
- name: Sepehr Amini Afshar
role: Author
contact:
- link: https://github.com/sepehrAmini
icon: fab fa-github
- link: https://www.linkedin.com/in/sepehr-amini-afshar-2201381b8
icon: fab fa-linkedin
- link: mailto://sepehraminiafshar@gmail.com
icon: fas fa-envelope

comments:
label: false
kind: comments

4 changes: 3 additions & 1 deletion notebooks/index.yml
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
- notebook: notebooks/logic_programming
- notebook: notebooks/hmm_speech_recognition
- notebook: notebooks/search_in_continuous_space
- notebook: notebooks/hmm_speech_recognition
- notebook: notebooks/clustering