Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is smoothing really needed for prob calc in bayes_classifier? #12

Open
jiabinf opened this issue Oct 18, 2016 · 4 comments
Open

Is smoothing really needed for prob calc in bayes_classifier? #12

jiabinf opened this issue Oct 18, 2016 · 4 comments

Comments

@jiabinf
Copy link

jiabinf commented Oct 18, 2016

Thanks for creating NaturalNode!

I am using your Bayes Classifier in my project, when looking into the implementation, I found it adds smoothing when calculating the probabilities.

This smoothing on unknown words in test set will cause probability to be skewed towards whichever class has the least amount of features. For instance:

say smoothing === 1, class A has 2 features, class B has 3, (0 + 1) / 2 is bigger than (0 + 1) / 3, A also wins.

I understand it may be good to have smoothing in training set, but is it really necessary for test set? Why not just discarding the tokens which are not in classFeatures[label]?

    while(i--) {
        if(observation[i]) {
            var count = this.classFeatures[label][i] || this.smoothing;
            // numbers are tiny, add logs rather than take product
            prob += Math.log(count / this.classTotals[label]);
        }
    }
@jiabinf jiabinf changed the title About bayes_classifier#probabilityOfClass Is smoothing really needed for prob calc in bayes_classifier? Oct 18, 2016
@jiabinf
Copy link
Author

jiabinf commented Oct 18, 2016

Rephased my question, also found some discussion here: http://stats.stackexchange.com/a/108990

@DrDub
Copy link
Contributor

DrDub commented Nov 3, 2016

Hi, it is correct that if you are evaluating a single unknown feature the system will always pick the same class for it. In general it is the majority class, but as you point out, smoothing might change that.

In general, Laplacian smoothing is a very poor smoothing technique but smoothing in general hinges in the how much probability mass allocate to unseen events. Don't use it during test seems to miss the point, but in case of a very poor smoothing algorithm, you might get ahead, yes. I hope to add Good Turing smoothing in some moment (and don't cry too much about it: www.csie.ntu.edu.tw/~b92b02053/print/good-turing-smoothing-without.pdf )

By the way, you can disable smoothing by setting epsilon to zero when using the classifier (in test mode).

If this answers your question, consider closing this bug as it does not affect the implementation of the algorithms in the code.

@jiabinf
Copy link
Author

jiabinf commented Dec 27, 2016

@DrDub thanks for your reply. I set smoothing to 0.01 and keep different training set balanced, overall it works well.

Still, looking forward to the new Good Turing smoothing. 👍

@ghost
Copy link

ghost commented Mar 22, 2018

@DrDub +1 for the PDF. It looks very interesting, I'll read it as soon as I have some free time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants