- Logistic Regression Mind Map
- About Logistic Regression
- Logistic regression using geometric intuition
- Weight vector
- L2 regularization; overfitting and underfitting
- L1 regularization & sparsity
- Logistic regression using Loss minimization
- Hyperparameter
- Column/feature standardization
- Feature importance and Model interpretation
- Collinearity or multicollinearlity
- Real world cases
- What if data is not linearly seperable ?
- Multinomial Logistics Regression
- Advantages of Logistics Regression
- Disadvantages of Logistics Regression
- Generalized linear models (GLM)
- Acknowledgements
- Connect with me
- Classification technique
- Simple algorithm
- Logistic regression calculates the probability that a given value belongs to a specific class. If the probability is more than 50%, it assigns the value in that particular class else if the probability is less than 50%, the value is assigned to the other class.
- Logistic regression has multiple perceptions:
- Geometric intuition (easy and simple)
- Probability
- Loss function
Assumption made by LR: Data is linearly or almost linearly seperable.
Task of LR is to find w and b which corresponds to the plain (pi) such that plain seperates positive points and negative points.
Diving into math piece.
Math optimization problem
Above equation is read as, Find the optimal w such that it maximizes argmax value.
In above equation we're using signed distance to find optimal w, signed distance is prone to outlier. Due to outlier we'll end up finding wrong w.
So have to modify math formulation, along with signed distance lets use function that avoids outliers.
One such function is sigmoid function.
- Why sigmoid ❓
- Sigmoid function tappers off when there is large value (outlier). When there is small value, sigmoid function behaves linearly
- Provides nice probabilistic interpretation
- The sigmoid function’s range is bounded between 0 and 1. Thus it’s useful in calculating the probability for the Logistic function
- It’s derivative is easy to calculate than other functions which is useful during gradient descent calculation
- It is a simple way of introducing non-linearity to the model.
Now, Optimization problem looks like:
Lets use monotonic function in above optimization problem for ease to operate. After using monotonic function equation looks like below:
EQUATION UPDATE HERE
simple rule in optimization is shown below:
Is the optimization problem of logistic regression.
From optimization problem, the w* is the weight vector.
w is the d dim vector. For every feature fi, corresponding wi weight is associated.
Interpretation of w:
- when w is positive, P(yq = +1) increases
- when w is negative, P(yq = -1) decreases
Therefore, given fi, the corresponding weight vector is positive then for any query point (xqi) the value corresponding to ith value increases. Its probability of it belonging to positive class increases & vice versa.
The weight vector (w) tends to positive and negative infinity.
- Regularization
The L2 regularization does not let w tends to infinity. Remaining part of OP is known as loss term
Lambda is the hyperparameter in logistic regression
- when lambda = 0; no regularization. Prone to overfit (high variance)
- when lambda = large value; influence of loss term is reduced i,e not using train data to find best w. Prone to underfit (High bias)
Min ( loss function over training data + regularization )
- Using Cross validation
- k-fold CV
- simple CV
Any alternatives to L2 reg is L1 reg.
- The L1 regularization does not let w tends to infinity
- Remaining part of OP is known as loss term
- Lambda is the hyperparameter in logistic regression
L1 reg and L2 reg are used for same purpose, but L1 reg has 1 major advantage i,e sparsity
- solution to logistic regression is said to be parse if many w's in weight vector are zero.
- When we use L1 reg, in logistic regression all less important features become zero.
- When we use L2 reg, wi becomes small values but not necessary zero
Another alternate to L2 reg is elastic net.
Elastic net utilizes both L1 reg and L2 reg.
Update equation!!
Loss minimization interpretation:
- Loss function as logistic loss gives Logistic regression
- Loss function as hinge loss gives SVM
- Loss function as exponential loss gives Adaboost
- Loss function as squared loss gives Linear regression
Lambda in optimization problem is the hyperparameter in logistic regression.
- when lambda = 0; overfit
- when lambda = infinity; underfit
- Grid search
- Random search
In logistic regression, mandatory to perform column/feature standardization before training the model because we're computing the distance between the line/plane and query point.
When we compute optimal w in weight vector, we can determine feature importance.
- Pick the abosolute value of large values present in w vector.
Utilizing the feature importance, another benefit is logistic regression is interpretation.
- Linear/hyperplane
- Data is linearly/almost linearly seperable
- Top values of w weight vector
- Perform Up/Down sampling
- logistic regression has less impact due to sigmoid function
- remove outliers and train
- Traditional strategies (reference)
- Binary seperable
- Low latency
- Fast to train
- Works fairly well on large dataset
- Worst case: When data is not linearly seperable
Logistic Regression works well if linearly seperable. What if data is not linearly seperable ??
- When data is not linearly seperable, we've to perform feature transformation such that features are transformed from original feature space to transformed feature space.
- We get know by learning/practising/experience.
- Many times, there are classification problems where the number of classes is greater than 2.
- We can extend Logistic regression for multi-class classification.
- The logic is simple; we train our logistic model for each class and calculate the probability(hθx) that a specific feature belongs to that class.
- Once we have trained the model for all the classes, we predict a new value’s class by choosing that class for which the probability(hθx) is maximum.
- Although we have libraries that we can use to perform multinomial logistic regression, we rarely use logistic regression for classification problems where the number of classes is more than 2.
- There are many other classification models for such scenarios.
- It is very simple and easy to implement.
- The output is more informative than other classification algorithms
- It expresses the relationship between independent and dependent variables
- Very effective with linearly seperable data
- Not effective with data which are not linearly seperable
- Not as powerful as other classification models
- Multiclass classifications are much easier to do with other algorithms than logisitic regression
- It can only predict categorical outcomes
- Extension to logistic regression is the GLM.
- Logistic regression in probability perspective is the combination of gaussian naive bayes plus the bernoulli random variable assumption on class labels
- From above statement, when we change bernoulli to multinomial distribution, we get multinomial logistic regression that is used for multiclass classification
- Similarly there are linear regression and poisson regression
- Google Images
- Appliedai
- Ineuron
- Other google sites