-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is smoothing really needed for prob calc in bayes_classifier? #12
Comments
Rephased my question, also found some discussion here: http://stats.stackexchange.com/a/108990 |
Hi, it is correct that if you are evaluating a single unknown feature the system will always pick the same class for it. In general it is the majority class, but as you point out, smoothing might change that. In general, Laplacian smoothing is a very poor smoothing technique but smoothing in general hinges in the how much probability mass allocate to unseen events. Don't use it during test seems to miss the point, but in case of a very poor smoothing algorithm, you might get ahead, yes. I hope to add Good Turing smoothing in some moment (and don't cry too much about it: www.csie.ntu.edu.tw/~b92b02053/print/good-turing-smoothing-without.pdf ) By the way, you can disable smoothing by setting epsilon to zero when using the classifier (in test mode). If this answers your question, consider closing this bug as it does not affect the implementation of the algorithms in the code. |
@DrDub thanks for your reply. I set smoothing to 0.01 and keep different training set balanced, overall it works well. Still, looking forward to the new Good Turing smoothing. 👍 |
@DrDub +1 for the PDF. It looks very interesting, I'll read it as soon as I have some free time. |
Thanks for creating NaturalNode!
I am using your Bayes Classifier in my project, when looking into the implementation, I found it adds smoothing when calculating the probabilities.
This smoothing on unknown words in test set will cause probability to be skewed towards whichever class has the least amount of features. For instance:
say smoothing === 1, class A has 2 features, class B has 3, (0 + 1) / 2 is bigger than (0 + 1) / 3, A also wins.
I understand it may be good to have smoothing in training set, but is it really necessary for test set? Why not just discarding the tokens which are not in classFeatures[label]?
The text was updated successfully, but these errors were encountered: