Directly applying naive bayes classifer got 0.837 accuracy. - 2014.01.28
Like use _RARE_ to replace all words occure less than 5 times. Use _DIGITS_ to replace all words that are digits, like 2014.
It improved the accuracy to 0.848. - 2014.01.29
Such that then difference of log probabilities that are too small(large in abs value) get weakened. It improve the accuracy to 0.872. - 2014.01.30
Improve accuracy to 0.879. - 2014.01.31
Improve accuracy to 0.880. - 2014.01.31
- Adjusting the _RARE_ threshold to other than 5 doesn't help.
- Adding Chinese surname lexicon didn't improve the model.
- Trying to use clustering method. Currently use KMeans(k = v/10) got 0.84 of accuracy.
v
is size of the vocalular.