- About NB Model
- Naive Bayes Mind Map
- Fundamentals of probability
- NB working
- NB Classifier on text data
- Limitations of NB
- Failure of NB
- Numerical Stability
- Bais-Variance trade off for NB
- Feature Importance in NB
- Interpretability in NB
- Imbalanced data in NB
- Outliers in NB
- Missing values in NB
- Can NB do multi-class classification ?
- Can NB handle large dimentional data ?
- Best & worst case of NB
- Advantages
- Disadvantages
- Acknowledgements
- Connect with me
- Based on fundamentals of probability
- Classification algorithm
- Simplistic & unsophiscated algorithm
Derive Bayes theorem from conditional probability
According to the Wikipedia, In probability theory and statistics,* Bayes’s theorem** (alternatively *Bayes’s law or Bayes’s rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Mathematically, it can be written as:
Where A and B are events and P(B)≠0
- P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
- P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
- P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.
The problem statement:
There are two machines which manufacture bulbs. Machine 1 produces 30 bulbs per hour and machine 2 produce 20 bulbs per hour. Out of all bulbs produced, 1 % turn out to be defective. Out of all the defective bulbs, the share of each machine is 50%. What is the probability that a bulb produced by machine 2 is defective?
We can write the information given above in mathematical terms as:
The probability that a bulb was made by Machine 1, P(M1)=30/50=0.6
The probability that a bulb was made by Machine 2, P(M2)=20/50=0.4
The probability that a bulb is defective, P(Defective)=1%=0.01
The probability that a defective bulb came out of Machine 1, P(M1 | Defective)=50%=0.5
The probability that a defective bulb came out of Machine 2, P(M2 | Defective)=50%=0.5
Now, we need to calculate the probability of a bulb produced by machine 2 is defective i.e., P(Defective | M2). Using the Bayes Theorem above, it can be written as:
P(Defective|M2) = P(M2|Defective) ∗ P(Defective) / P(M2)
Substituting the values, we get:
P(Defective|M2) = 0.5 ∗ 0.01 / 0.4 = 0.0125
- When a new word is provided in test query which is not present in training data, likelyhood of new word cannot be computed.
- what are the possibilities to handle ?
- how about dropping the new word ? Dropping is equivalent to saying likelyhood is equal to 1
- If likelyhood is assigned to 0, the entire bayes theorem multiplication results into 0
- Therefore, value 1 or 0 doesn't make sense!!
- We need to have a better scheme to handle i,e Laplace Smoothing / Additive Smoothing
- https://en.wikipedia.org/wiki/Additive_smoothing
- alpha is the hyperparameter. When alpha is large, approximatly 1/2 (half) value will be assigned to the likelyhood.
- 1/2 (half) value is better because its good to say 1/2 to the untrained/new word during test time.
- Called smoothing because as the value of alpha increases, the likelyhood is decreased gradually. The decreasing value is known as smoothing
- Note: Laplace smoothing will be applied not only during test time, also during training time
- when training dataset dimensionality is large and multiplication happens between 0 to 1 in probability, we end up getting very small number. Ex: 0.0004
- To avoid numerical stability issue in NB, instead of operating on small values operate on log values
- High Bias --> Underfitting
- High Variance --> Overfitting
alpha in laplace smoothing
- When alpha is 0;
- for rare words in training data, probability is computed
- when the rare words are removed in training data, 0 probability is returned.
- So when there is small change in data, there is large impact on model predictions.
- Is the high variance and results in overfitting
- When alpha is very large; i,e alpha = 10,000
- Approximately 1/2 (half) is the probability
- All the prediction is predicted as 1/2. Model cannot distinguish between 0 or 1
- Is the high bias and results in underfitting
Therefore Bias & Variance trade off depends on alpha value
- Hyperparameter tuning using cross validation
- In NB, likelyhood of all words are computed
- Sort the likelyhood of words in descending order and pick top "n" words for feature importance
- For the predicted class, we shall provide "n" (Ex: word1, word2, . . . . wordn ) words as evidence
- Ex: for classification of positive and negative review
- Phenomenal, great and terrific are the high occurence for positive review
- Class prior which are majority/dominating class have an advantage when comparing two probablties
- Hence majority class will be predicted at prediction time
- NB is affected from imbalanced data
- Upsampling
- Downsampling
- Find the words that fewer occurs in training data for outlier
- When outlier is present in testing time, laplace smoothing will take care of it
- Ignore rare words
- Use laplace smoothing
- text data: there is no case of missing data in case of text data
- categorical feature: consider "NAN" as a category
- Numerical feature: impute mean, median, etc
- NB supports multi-class classification.
- compares against all the probabilities and returns the maximum probability
- NB does text classification i,e high dimentional capability
- So NB is able to handle large dim data
Note: Make sure to use log probabilities in order to avoid nmerical overflow or stability underflow issue
- When true; NB performs well
- When false; NB performance degrades
- NB works fairly well
- email/review classification: high dimentional data
- NB works well &
- NB is the baseline/benchmark model
- seldom or not used much
- because store only prior and likelihood probabilities
- NB is all about counting
- Naive Bayes is extremely fast for both training and prediction as they not have to learn to create separate classes.
- Naive Bayes provides a direct probabilistic prediction.
- Naive Bayes is often easy to interpret.
- Naive Bayes has fewer (if any) parameters to tune
- The algorithm assumes that the features are independent which is not always the scenario
- Zero Frequency i.e. if the category of any categorical variable is not seen in training data set even once then model assigns a zero probability to that category and then a prediction cannot be made.
- Google Images
- Appliedai
- Ineuron
- Other google sites