-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
14_regression #25
Open
PouyaEsmaili
wants to merge
21
commits into
sut-ai:master
Choose a base branch
from
PouyaEsmaili:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
14_regression #25
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
ab86c82
Add main structure
PouyaEsmaili fd438fe
Add introduction
PouyaEsmaili 511d89f
Add linear regression
PouyaEsmaili e70aa02
Update index.yml
PouyaEsmaili 2dd3978
Update index.yml
PouyaEsmaili a7a279c
Adds general contents structure
de6b19e
Finish linear regression
PouyaEsmaili ed943d3
Add polynomials
PouyaEsmaili 032ea31
Adds Logistic Regression v1.0
c235ecc
Merge branch 'master' of https://github.com/PouyaEsmaili/notes
1bc6dbb
updates logistic regression to v2.0
34c3cd0
Adds table of contents v1.0
baa7991
Add examples and Refactor
PouyaEsmaili aa33bdf
Add conclusion
PouyaEsmaili f4a5e1c
Fix dictations
PouyaEsmaili e0da212
Update config files
PouyaEsmaili f578746
Merge branch 'sut-ai:master' into master
PouyaEsmaili ca7af34
Fix review requests
PouyaEsmaili 7787d7b
Fix review requests
PouyaEsmaili 9e4a90f
Added hyperlinks to References
646112c
Revert table of contents to the correct form
PouyaEsmaili File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,220 @@ | ||
# Regression | ||
|
||
# Table of Contents | ||
|
||
- [Introduction](#introduction) | ||
- [1 - Linear Regression](#2---linear-regression) | ||
- [Loss Function](#loss-function) | ||
- [Mean Squared Error (MSE)](#mean-squared-error-mse) | ||
- [Finding $\hat{w}$](#finding-hatw) | ||
- [Gradient Descent](#gradient-descent) | ||
- [Normal Equation](#normal-equation) | ||
- [2 - Learning Curves Using Polynomials](#3---learning-curves-using-polynomials) | ||
- [Reduction to Linear Regression](#reduction-to-linear-regression) | ||
- [Overfitting](#overfitting) | ||
- [Using Validation Set (Held-Out Data)](#using-validation-set-held-out-data) | ||
- [Regularization](#regularization) | ||
- [3 - Logistic Regression](#5---logistic-regression) | ||
- [How to Calculate Probabilities](#how-to-calculate-probabilities) | ||
- [Defining Line _l_ and Cost Function $J(\theta)$](#defining-line-l-and-cost-function-jtheta) | ||
- [Training](#training) | ||
- [Conclusion](#conclusion) | ||
- [References](#references) | ||
|
||
# Introduction | ||
In regression, we have some data points and a number or a label is assigned to each data point. Our goal is to predict the number or label of an unseen data point after learning from the data we already have. We assume each data point is a vector $x$ and we want to predict $f(x)$. The first idea is to use interpolation. By using interpolation, we will have a high degree polynomial that fits our training data perfectly. But the problem is that interpolation leads to overfitting. So the error for unseen data will be too large. In regression, we aim to find the best curve with a lower degree. Although there will be some training error here, our test error will decrease since we are avoiding overfitting. | ||
|
||
# 1 - Linear Regression | ||
|
||
Here, we want to assign $f(x)$ to each data point $x$. In linear regression, we assume $f$ is a linear function. We can define $f$ as | ||
|
||
$$ | ||
PouyaEsmaili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
f_w(x) = w^T x = \begin{bmatrix} | ||
w_0 & w_1 & \cdots & w_n | ||
\end{bmatrix} . \begin{bmatrix} | ||
1 \\ x_1 \\ \vdots \\ x_n | ||
\end{bmatrix}. | ||
$$ | ||
We assumed $x_0 = 1$ in $x$ to have bias in our function. So the data points are in an n-dimensional space. | ||
|
||
We also define $y$ as | ||
|
||
$$ | ||
y_w = \begin{bmatrix} | ||
f_w(x^{(1)}) \\ f_w(x^{(2)}) \\ \vdots \\ f_w(x^{(m)}) | ||
\end{bmatrix} | ||
$$ | ||
where $x^{(i)}$ is the i'th data point. | ||
|
||
## Loss Function | ||
After defining $f$, we need to find the best function. By defining a loss function, we can try to minimize loss by changing $w$ in the main function. Assuming $L$ as our loss function, best $f$ will be | ||
|
||
$$ | ||
\hat{w} = argmin_w L(y_w, \hat{y}) \rightarrow f_{best}(x) = f_{\hat{w}}(x) | ||
$$ | ||
|
||
where $\hat{y}$ is the given number for each data point. | ||
|
||
### Mean Squared Error (MSE) | ||
The main loss function we use is mean squared error. It's defined as | ||
|
||
$$ | ||
MSE(y_w, \hat{y}) = \frac{1}{2} \Sigma_{i=1}^m \left[ y_w^{(i)} - \hat{y}^{(i)} \right]^2 . | ||
$$ | ||
The main reason for using this function is that we can calculate gradients easily. So we can use gradient descent to find $\hat{w}$. | ||
|
||
## Finding $\hat{w}$ | ||
|
||
### Gradient Descent | ||
We want to use gradient descent to find $\hat{w}$. First, we need to calculate $\nabla_w L(y_w, \hat{y})$ because it's used in the gradient descent method. The partial derivatives for MSE are: | ||
|
||
$$ | ||
\frac{\partial MSE}{\partial w_j} = - \Sigma_{i = 1}^m x_j^{(i)} \left[ y_w^{(i)} - \hat{y}^{(i)} \right] | ||
$$ | ||
|
||
So for gradinet, we have: | ||
|
||
$$ | ||
\nabla_w MSE = \begin{bmatrix} | ||
\frac{\partial MSE}{\partial w_0} \\ \vdots \\ \frac{\partial MSE}{\partial w_n} | ||
\end{bmatrix} | ||
$$ | ||
|
||
Now, we can find $\hat{w}$ by the gradient descent algorithm as | ||
|
||
$$ | ||
w^{(i+1)} = w^{(i)} - \eta \nabla_{w^{(i)}} L | ||
$$ | ||
|
||
where $\eta$ is the learning rate. | ||
|
||
### Normal Equation | ||
If we define | ||
|
||
$$ | ||
X = \begin{bmatrix} | ||
x^{(1)} \\ \vdots \\ x^{(m)} | ||
\end{bmatrix} | ||
$$ | ||
|
||
and solve the equation $\nabla_w MSE = 0$, we get the normal equation. It's defined as | ||
$$ | ||
\hat{w} = (X^TX)^{-1} X^T \hat{y}. | ||
$$ | ||
However, for using this equation, our features must be linearly independent. Otherwise, $(X^TX)^{-1}$ is not defined. In that case, we can use the pseudo inverse of $X^TX$ instead of the $(X^TX)^{-1}$. Since the calculation of inverse is computationally inefficient, we usually prefer using gradient descent. | ||
|
||
# 2 - Learning Curves Using Polynomials | ||
|
||
In this section, we want to find the best $P(x)$ where x is a real number and $P$ is an n'th degree polynomial. | ||
|
||
## Reduction to Linear Regression | ||
|
||
We can define | ||
$$ | ||
z = \begin{bmatrix} | ||
x^0 \\ x^1 \\ \vdots \\ x^n | ||
\end{bmatrix}, | ||
$$ | ||
then use linear regression to find $f_w(z) = w^T z$. From definition, we know that $P(x) = f_w(z)$. However, if $n$ is too large, overfitting might happen since we are getting closer and closer to interpolation. | ||
|
||
## Overfitting | ||
To prevent overfitting, we must try to define $n$ optimally. Since $n$ is a hyperparameter here, we can use validation set. | ||
|
||
The example below demonstrates how $n$ can change the curve we find and when overfitting happens. The green curve is our goal. The red curve is the M'th degree polynomial using this method. | ||
|
||
![M = 1](images/1.png) | ||
PouyaEsmaili marked this conversation as resolved.
Show resolved
Hide resolved
|
||
![M = 3](images/3.png) | ||
![M = 9](images/9.png) | ||
|
||
### Using Validation Set (Held-Out Data) | ||
We split a part of the training data, and don't use it for training. Then, by calculating the loss function over this data, we can optimize $n$. Since the model hasn't seen these data points, we can be sure that overfitting will decrease. | ||
|
||
We can see in the above examples that $M = 3$ gives us the best curve. If we compare the validation loss for $M=3$ and $M=9$, the loss will be less for $M=3$ since it's closer to the green curve. | ||
|
||
### Regularization | ||
Although using a validation set is a good way to prevent overfitting, We still might try to find the curve which best fits the data, not the curve which best matches it. Regularization tries to make $w$ smaller. The intuition is that large coefficients in $w$ happen because it tries to fit the points. It means that the curve will only get closer to each point. However, decreasing coefficients will make a better curve which might be further from each point, but does a better job at predicting the unseen data. | ||
|
||
We only define $l_2-\text{regularization}$ since it's easier to derive and understand. The loss function is | ||
|
||
$$ | ||
\tilde{E}(y_w, \hat{y}) = \frac{1}{2} \Sigma_{i=1}^m \left[ y_w^{(i)} - \hat{y}^{(i)} \right]^2 + \frac{\lambda}{2}||w||^2 | ||
$$ | ||
where $\lambda$ is a hyperparameter that controls how small the $w$ should be. | ||
|
||
In the following examples, we use regularization with 9'th degree polynomials. We can compare these with the overfitted example above. Also, we can see how important the choice of $\lambda$ is. So, using the validation set might be still useful to optimize $\lambda$. | ||
|
||
![Reguralization 18](images/r18.png) | ||
![Reguralization 0](images/r0.png) | ||
|
||
# 3 - Logistic Regression | ||
|
||
There are some regression algorithms that can be used for classification (And vise versa). _Logistic Regression_ is one of these algorithms. It can calculate the probability that an instance belongs to a particular class. We can use this probability for classification: if the probability is higher than 0.5 then that instance belongs to the particular class. | ||
|
||
## How to Calculate Probabilities | ||
|
||
Let's start with a simple example. We want to classify flowers that are _Iris Virginica_ from those that are _not Iris Virginica_. This classification is based on Petal Width and Petal Height. Also, consider that there is some line _l_ (black line in the image below) that separates these 2 types of flowers. | ||
|
||
![image info](./images/Logistic%20Regression.jpg) | ||
|
||
Now, to calculate the probabilities, we use the intuition that, the further we are from _l_, the higher is the probability of belonging to that particular class. | ||
To be more exact, consider we want to find the probability of a given flower being _Iris Virginica_. Let's call the signed distance between the flower and the line, _t_. Signed-distance means if we are in the _Iris Virginica_ region, the distance is positive, otherwise, if we are in the _not Iris Virginica_ region, the distance is negative. The probability of this flower being _Iris Virginica_ is related to the value of _t_, the higher the value of _t_, the higher the probability. So we need a function $\sigma(.)$ that when given the singed distance _t_, returns the probability _p_. | ||
The logistic function is a common choice for this purpose. It is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1. | ||
|
||
Logistic function: | ||
$$ | ||
\sigma(t) = \frac{1}{1 + \exp(-t)} | ||
$$ | ||
|
||
In the image above, you can see the probability of colored lines. For example, all the points on the green line have a probability of 0.9. These probabilities are calculated using the Logistic function. | ||
Also in the image below, you can see the value of the Logistic function for different inputs: | ||
|
||
![image info](./images/Logistic%20function.jpg) | ||
|
||
## Defining Line _l_ and Cost Function $J(\theta)$ | ||
|
||
Now that we know how to calculate the probabilities, the problem is reduced to finding the best line _l_ that separates 2 classes. For this, first, we need to define the line and second, we need to define a cost function $J(\theta)$ and then we should find the line _l_ such that it minimizes $J(\theta)$. | ||
|
||
We define line _l_ with parameters $\theta$ like before. With this definition, the signed distance between x and the _l_ can be calculated as follows: | ||
$$ | ||
t_{\theta}(x) = x^{T}\theta | ||
$$ | ||
Now we can define the probability of a given x with respect to line _l_ with parameters $\theta$: | ||
|
||
$$ | ||
\hat{P} = \sigma(t_{\theta}(x)) = \sigma(x^{T}\theta) | ||
$$ | ||
|
||
We consider that our dataset has _N_ data points $X_i$ with the label $y_i \in \{0 , 1\}$. | ||
The cost function for just a single data point can be defined as: | ||
|
||
$$ | ||
J(\theta) = -log(\hat{P})y -log(1 - \hat{P})(1 - y) | ||
$$ | ||
where $\hat{P}$ is the probability predicted for this data point given the line _l_ with parameters $\theta$. | ||
|
||
And for _N_ data points, we can define cost function as: | ||
|
||
$$ | ||
J(\theta) = -\frac{1}{N}\sum_{i = 1}^{N}log(\hat{P}^{i})y^{i} -log(1 - \hat{P}^{i})(1 - y^{i}) | ||
$$ | ||
where $\hat{P}^{i}$ is the probability predicted and $y^{i}$ is the label for the $i^{th}$ data point given the line _l_ with parameters $\theta$. | ||
|
||
## Training | ||
|
||
Unfortunately, there is no known closed-form equation to compute the value of $\theta$ that minimizes this cost function. But the good news is that this cost function is convex! So we can use gradient descent to find the minimum. Because of convexity, Gradient Descent is guaranteed to find the global minimum (with the right learning rate). | ||
We can compute the partial derivatives for the $j^{th}$ parameter of $\theta$ as follows: | ||
|
||
$$ | ||
\displaystyle \frac{\partial}{\partial\theta_{j}}J(\theta) = -\frac{1}{N}\sum_{i = 1}^{N}(\sigma(\theta^{T}x^{i}) - y^{i})x^{i}_{j} | ||
$$ | ||
|
||
# Conclusion | ||
|
||
Regression is used in a variety of AI problems. We can use it to find the best line or curve to predict a function, or to find the best hyperplane dividing 2 classes in a classification problem. In this lecture note, we discussed how to solve these problems and how to prevent overfitting. However, for more complex functions or more classes, you might want to take a look at neural networks. | ||
|
||
# References | ||
- Artificial Inteligence Course at Sharif University of Technology (Fall, 2021) | ||
- Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems - 2th Edition | ||
- [*Linear Regression* from Wikipedia.org](https://en.wikipedia.org/wiki/Linear_regression) | ||
- [*Compute the gradient of mean square error* from math.stackexchange.com](https://math.stackexchange.com/questions/1962877/compute-the-gradient-of-mean-square-error) | ||
- [*Normal Equation in Python: The Closed-Form Solution for Linear Regression* from towardsdatascience.com](https://towardsdatascience.com/normal-equation-in-python-the-closed-form-solution-for-linear-regression-13df33f9ad71) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
title: Regression | ||
|
||
header: | ||
title: Regression | ||
|
||
authors: | ||
label: | ||
position: top | ||
text: Authors | ||
kind: people | ||
content: | ||
- name: Mehrshad Mirmohammadi | ||
role: Author | ||
contact: | ||
- icon: fab fa-github | ||
link: https://github.com/Helium-5 | ||
- icon: fas fa-envelope | ||
link: mailto:mehrshad.mirmohammadi@gmail.com | ||
|
||
- name: Nazanin Azarian | ||
role: Author | ||
contact: | ||
- icon: fab fa-github | ||
link: https://github.com/Nazhixx | ||
- icon: fas fa-envelope | ||
link: mailto:badi.mojgan@gmail.com | ||
|
||
- name: Pouya Esmaili | ||
role: Author | ||
contact: | ||
- icon: fab fa-github | ||
link: https://github.com/PouyaEsmaili | ||
- icon: fas fa-envelope | ||
link: mailto:poesmaili@gmail.com | ||
|
||
- name: Nima Jamali | ||
role: Supervisor | ||
contact: | ||
- icon: fas fa-envelope | ||
link: mailto:nimxj4141@gmail.com |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title of the page should be in another format. Change it and put the name of authors inside, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how it should be. Comparing with other notebooks, the title seems fine. @nimajam41