Skip to content
NewAlexandria edited this page Aug 14, 2016 · 1 revision

Even if you have no desire to understand the probabilistic engine beneath the hood, Naive Bayes is easy to use, high performance, and accurate relative to other classifiers. It requires a 2 step process:

  1. Train the classifier by providing it with sets of tokens (e.g., words) accompanied by a class (e.g., 'SPAM')
  2. Run the trained classifier on un-classified (e.g., unlabelled) tokens and it will predict a class

Quickstart

gem install nbayes

Training

After that, it's time to begin training the classifier:

# create new classifier instance
nbayes = NBayes::Base.new
# train it - notice split method used to tokenize text (more on that below)
nbayes.train( "You need to buy some Viagra".split(/\s+/), 'SPAM' )
nbayes.train( "This is not spam, just a letter to Bob.".split(/\s+/), 'HAM' )
nbayes.train( "Hey Oasic, Do you offer consulting?".split(/\s+/), 'HAM' )
nbayes.train( "You should buy this stock".split(/\s+/), 'SPAM' )

Finally, let's use it to classify a document:

# tokenize message
tokens = "Now is the time to buy Viagra cheaply and discreetly".split(/\s+/)
result = @nbayes.classify(tokens)
# print likely class (SPAM or HAM)
p result.max_class
# print probability of message being SPAM
p result['SPAM']
# print probability of message being HAM
p result['HAM']

But that's not all! I'm claiming that this is a full-featured Naive Bayes implementation, so I better back that up with information about all the goodies. Here we go:

Features

  1. Works with all types of tokens, not just text. Of course, because of this, we leave tokenization up to you.
  2. Disk based persistence
  3. Allows prior distribution on classes to be assumed uniform (optional)
  4. Outputs probabilities, instead of just class w/max probability
  5. Customizable constant value for Laplacian smoothing
  6. Optional and customizable purging of low-frequency tokens (for performance)
  7. Optional binarized mode to reduce the impact of repeated words
  8. Uses log probabilities to avoid underflow
Clone this wiki locally