In this lesson, we will apply Annif on an automated classification task, where the goal is to classify each document into a single class or category from a classification with mutually exclusive classes; in machine learning terms, this is called multiclass classification. This is different from the kind of subject indexing performed in earlier exercises, where the goal was to assign a small number of representative subjects for a document (also called multi-label classification).
Classification can be seen as a tougher task, because there is only one correct answer that the algorithm must find and there are no partially correct answers. In the library world, this kind of setting is common with library classifications such as the Dewey Decimal Classification and the Universal Decimal Classification. The classifications are often used to determine the location of books on shelves and each book needs to be placed on exactly one shelf.
In this exercise, we will use a small toy classification called 20 Newsgroups instead of a large library classification, which could include tens of thousands of classes. The 20 Newsgroups data set is a set of messages posted to twenty Usenet discussion groups dedicated to differnt topics - similar to mailing lists or web forums - in the early days of the Internet. This data set is often used to benchmark text classification algorithms. For more information about the data set and how it was prepared for use as an Annif corpus, see the README file of the corpus files.
Not all Annif algorithms are well suited for multiclass classification. The lexical algorithms MLLM and STWFSA rely a lot on information from the vocabulary, such as term labels and semantic relations, but typical classifications either don't have this information or it cannot be effectively used by the algorithms.
We will introduce a new algorithm called SVC (Support Vector Classification), which is a supervised learning model for classification that is based on the idea of support-vector machines. This is a relatively lightweight associative algorithm that works very well even with limited amounts of training data. We will also use the Omikuji backend for comparison, another good choice for classification tasks.
For Omikuji, we will use a Bonsai configuration (see the Omikuji exercise for details), which is somewhat heavier to train than the basic Parabel configuration but usually provides better results.
Use a text editor to add new project definitions to the end of the
projects.cfg
file.
[20news-svc-en]
name=20 Newsgroups SVC English
language=en
backend=svc
analyzer=snowball(english)
limit=100
vocab=20news
[20news-omikuji-bonsai-en]
name=20 Newsgroups Omikuji Bonsai English
language=en
backend=omikuji
analyzer=snowball(english)
vocab=20news
cluster_balanced=False
cluster_k=100
max_depth=3
Check that the configuration is valid:
annif list-projects
Run this command:
annif load-vocab --language en 20news data-sets/20news/20news-vocab.tsv
The vocabulary file is small and simple, it just contains a line for each
newsgroup containing the newsgroup URI and name. Since this is a TSV file
with no language information, we need to use the --language
option to
indicate that the subject labels (newsgroup names) are in English.
Run this command:
annif train 20news-svc-en data-sets/20news/20news-train.tsv
Model training should take less than a minute.
First, we can take a look at the first document from the test set using this command:
head -n 1 data-sets/20news/20news-test.tsv
It is a message that looks like this:
I am a little confused on all of the models of the 88-89 bonnevilles. I have heard of the LE SE LSE SSE SSEI. Could someone tell me the differences are far as features or performance. I am also curious to know what the book value is for prefereably the 89 model. And how much less than book value can you usually get them for. In other words how much are they in demand this time of year. I have heard that the mid-spring early summer is the best time to buy. <news:rec.autos>
The message is about cars and was posted to the rec.autos
newsgroup, as
can be seen from the tag at the end. But this is a difficult document to
classify: the topic may not be entirely obvious from the text as it doesn't
directly mention cars. We can check what newsgroups the SVC algorithm
suggests for this text. We can pipe the text through the cut
command in
order to strip away the tag at the end, leaving just the text, and pipe it
directly to the annif suggest
command:
head -n 1 data-sets/20news/20news-test.tsv | cut -f 1 | annif suggest 20news-svc-en
Is the first suggestion rec.autos
or something else? If rec.autos
is
not the top suggestion, how close is it to the top?
We can then evaluate the model on the whole test set. Run this command:
annif eval 20news-svc-en data-sets/20news/20news-test.tsv
Evaluation should take around a minute. Check the Precision@1 score, which indicates the proportion of the first suggestions of the algoritm that are considered correct; in this kind of multiclass setting, this corresponds to the accuracy of the classifier. Write down this number so you can compare it with the results of further experiments.
Run this command:
annif train 20news-omikuji-bonsai-en data-sets/20news/20news-train.tsv
Model training should take around one minute.
We can test the output of the Omikuji project just like we did for SVC above:
head -n 1 data-sets/20news/20news-test.tsv | cut -f 1 | annif suggest 20news-omikuji-bonsai-en
Was the top suggestion correct this time? If not, how far from the top was
the correct answer rec.autos
?
Run this command:
annif eval 20news-omikuji-bonsai-en data-sets/20news/20news-test.tsv
Evaluation should take around 1 minute. Again, check the Precision@1 score and compare it with the result you got from the SVC evaluation above. Which algorithm worked better?
See details on how to use bigrams
The above defined projects relied on the default value of the ngram
setting, which is 1. This is a setting that affects the vectorizer, i.e. the
preprocessing of text which turns words into numeric vectors. By changing
the ngram
setting to 2, we can instruct the vectorizer to use
bigrams (pairs of consecutive words)
as well as unigrams (single words). This will extract the maximum amount of
information from the relatively short texts available and thus hopefully
improve classification accuracy, at the cost of a larger and heavier model.
Add this settings to both the SVC and Omikuji projects you added above:
ngram=2
Then retrain and evaluate both projects. Did you get a better result? Did it take longer and/or consume more resources?
Including bigrams can increase the size of the model quite drastically,
especially for larger vocabularies and training corpora. To keep resource
usage in control, we can also use the min_df
setting. This will instruct
the vectorizer to ignore tokens (bigrams or unigrams) that only appear in a
small number of documents of the training set. This will reduce the number
of features and thus the size of the model and the resource consumption.
Add this settings to both the SVC and Omikuji projects so that tokens (unigrams or bigrams) must appear in at least two documents to be considered:
min_df=2
Then retrain and evaluate both projects. How did this affect the result?
Congratulations, you've completed the classification exercise, you have performed classification using two different algorithms and compared their results!
For more information, see the documentation in the Annif wiki: