Judgr (pronounced as judger) is a naïve Bayes classifier library written in Clojure which features multivariate classification, support for cross validation, and more.
- Multivariate classification
- Biased and unbiased class probabability
- Configurable Laplace Smoothing
- Configurable threshold validation
- K-fold cross-validation
- Precision, Recall, Specificity, Accuracy, and F1 score
Add the following dependency to your project.clj file:
[judgr "0.3.0"]
The first step is to instantiate the classifier given the current settings:
user=> (use '[judgr.core]
'[judgr.settings])
nil
user=> (def classifier (classifier-from settings))
#'user/classifier
Now you can start training the classifier with (.train! classifier item :class)
:
(.train! classifier "How are you?" :positive)
(.train! classifier "Burn in hell!" :negative)
(.train! classifier ...)
If you want to train all examples of a given class at once, there's
also (.train-all! classifier items :class)
:
(def positive-items ["How are you?" ...])
(def negative-items ["Burn in hell!" ...])
(.train-all! classifier positive-items :positive)
(.train-all! classifier negative-items :negative)
Or train all examples of different classes:
(.train-all! classifier [{:item "How are you?" :class :positive}
{:item "Burn in hell!" :class :negative}])
The default classifier saves data to memory and are capable of
extracting words from the given text using a porter stemmer. Also,
items can be classified as either :positive
or :negative
. If your
problem requires different settings, please take a look at the
Extending The Classifier section below.
After some training you should be able to use the classifier to guess on which class that item falls into:
user=> (.classify classifier "Long time, no see.")
:positive
user=> (.classify classifier "Go to hell.")
:negative
It's also possible to get the probabilities for all classes:
user=> (.probabilities classifier "Long time, no see.")
{:negative 0.38461539149284363, :positive 0.6153846383094788}
It's not that trivial to measure how well the classifier is generalizing to examples it doesn't know about. Fortunately, there's a common technique to evaluate an algorithm's performance that is known as Cross-validation.
The output of a K-Fold Cross-validation process is a Confusion Matrix.
user=> (use 'judgr.cross-validation)
nil
user=> (def conf-matrix (k-fold-crossval 2 classifier))
#'user/conf-matrix
user=> conf-matrix
{:positive {:positive 102
:negative 3}
:negative {:positive 7
:negative 186}}
This Confusion Matrix will tell, for each known class, how many items
it predicted correctly, and how many items it predicted as being in
another class. For example, for all items known as :positive
, 102 items
were flagged correctly and 3 were flagged incorrectly as :negative
.
Although this helps, it would be nice to have ways to calculate a single number score.
The Accuracy is the percentage of predictions that the classifier got correct:
user=> (accuracy conf-matrix)
144/149
In case of low accuracy, there are other calculations that might help you identify what's wrong.
Precision is a measure of the accuracy provided that a specific class has been predicted:
user=> (precision :positive conf-matrix)
102/109
user=> (precision :negative conf-matrix)
62/63
Recall is a measure of the ability of a model to select instances of a certain class from a data set. It is commonly also called Sensitivity, and corresponds to the true positive rate:
user=> (recall :positive conf-matrix)
34/35
user=> (recall :negative conf-matrix)
186/193
user=> (sensitivity :negative conf-matrix)
186/193
Specificity indicates the ability of a model to identify negative results, that is, the proportion of negative instances predicted as negative:
user=> (specificity :positive conf-matrix)
186/193
user=> (specificity :negative conf-matrix)
34/35
F1 Score is a weighted average of the precision and recall of a given class:
user=> (f1-score :positive conf-matrix)
102/107
user=> (f1-score :negative conf-matrix)
186/191
There are several ways to change the way the classifier works.
Change the [:classes]
setting to the classes you want to use. For
example, if you are building a spam classifier:
(use 'judgr.settings)
(def my-settings
(update-settings settings
[:classes] [:ham :spam]
[:classifier :default :thresholds] {:ham 1.2
:spam 2.5}))
Note that we also specified thresholds for the new classes.
We provide simple implementations for English (default) and Brazilian Portuguese, based on the work done in Apache Lucene.
The first thing you have to do is create a type that extends the
FeatureExtractor
protocol:
(ns your-ns
(:use [judgr.extractor.base]))
(deftype CustomExtractor [settings]
FeatureExtractor
(extract-features [fe item]
;; Feature extraction logic here
))
Finally, define a new method for extractor-from
multimethod that
knows how to create a new instance of CustomExtractor
:
(ns your-ns
(:use [judgr.core]))
(defmethod extractor-from :custom [settings]
(CustomExtractor. settings))
To use the new extractor, just create a new settings map with
[:extractor :type]
setting configured to :custom
, the same key
used in defmethod
:
user=> (use 'judgr.settings)
nil
user=> (def my-settings
(update-settings settings
[:extractor :type] :custom))
#'user/my-settings
user=> (extractor-from my-settings)
#<CustomExtractor ...>
In-memory integration is enabled by default.
There are ready-to-use integration packages for other databases:
The procedure is similar to what was shown in Providing Your Own Feature Extractor section.
First, create a new type that extends the FeatureDB
protocol:
(ns your-ns
(:use [judgr.db.base]))
(deftype CustomDB [settings]
FeatureDB
(add-item! [db item class]
;; ...
)
;; Implement the other methods
)
Then, define a new method for db-from
multimethod that knows how to
create a new instance of CustomDB
:
(ns your-ns
(:use [judgr.core]))
(defmethod db-from :custom [settings]
(CustomDB. settings))
To use the new database layer, just create a new settings map with
[:database :type]
setting configured to :custom
, the same key used
in defmethod
:
user=> (use 'judgr.settings)
nil
user=> (def my-settings
(update-settings settings
[:database :type] :custom))
#'user/settings
user=> (db-from my-settings)
#<CustomDB ...>
There's a default classifier implementation that should be enough for most cases since it is already fairly configurable.
If threshold validation is enabled,
i.e. [:classifier :default :threshold?]
setting is true
, an item
will only be flagged as a class if its probability is at least X
times greater than the second highest probability. The threshold for
each class can be configured in [:classifier :default :thresholds]
setting:
(use 'judgr.settings)
(def my-settings
(update-settings settings
[:classifier :default :threshold?] true
[:classifier :default :thresholds] {:positive 1 :negative 2}))
If the probabilities for an item are {:positive 0.45 :negative 0.55}
, and
their thesholds are 1 and 2, respectively, the item will be flagged
with the value defined in [:classifier :default :unknown-class]
setting, which is :unknown
by default.
Smoothing is enabled by default, and it's useful to deal with unknown features by not returning a flat zero probability.
You can change the [:classifier :default :smoothing-factor]
setting
to change the smoothing intensity, although the default value is
usually good enough:
(use 'judgr.settings)
(def my-settings
(update-settings settings
[:classifier :default :smoothing-factor] 0.7))
Although it's not recommended, you can turn smoothing off by changing
the [:classifier :default :smoothing-factor]
setting to zero.
By default, class probabilities are calculated in a biased fashion,
that is, considering the number of items flagged in each class. For
example, considering smoothing is disabled, if there's no item flagged
as :negative
, the probability P(negative) = 0. Similarly, if
there's 3 negative items out of 10, then P(negative) = 3/10.
If the [:classifier :default :unbiased?]
setting is configured to
true
, the probability P(any_class) = 1/(number_of_classes):
(use 'judgr.settings)
(def my-settings
(update-settings settings
[:classifier :default :unbiased?] true))
```
### Providing Your Own Classifier
First, create a new type that extends the `Classifier` protocol:
````clojure
(ns your-ns
(:use [judgr.classifier.base]))
(deftype CustomClassifier [settings db extractor]
Classifier
(train! [c item class]
;; ...
)
;; Implement the other methods
)
Then, define a new method for classifier-from
multimethod that knows
how to create a new instance of CustomClassifier
:
(ns your-ns
(:use [judgr.core]))
(defmethod classifier-from :custom [settings]
(let [db (db-from settings)
extractor (extractor-from settings)]
(CustomClassifier. settings db extractor)))
To use the new classifier, just create a new settings map with
[:classifier :type]
setting configured to :custom
, the same key
used in defmethod
:
user=> (use 'judgr.settings)
nil
user=> (def my-settings
(update-settings settings
[:classifier :type] :custom))
#'user/settings
user=> (classifier-from my-settings)
#<CustomClassifier ...>
If this project is useful for you, buy me a beer!
Bitcoin: bc1qtwyfcj7pssk0krn5wyfaca47caar6nk9yyc4mu
Copyright (C) Daniel Fernandes Martins
Distributed under the New BSD License. See COPYING for further details.