-
Notifications
You must be signed in to change notification settings - Fork 41
Backend: TF IDF
Osma Suominen edited this page Nov 13, 2018
·
10 revisions
The TF-IDF backend is a baseline algorithm for automated subject indexing. The idea is to count the frequencies of terms (words) used in documents about each subject, use the TF-IDF algorithm to weight the term frequencies so that rare words are more important than frequently occurring ones, and to create an index for matching term frequencies in new documents to those about specific subjects. The implementation is based on the topic modelling library Gensim.
It is really easy to get started using the TF-IDF backend since it doesn't require any algorithm-specific configuration.
[tfidf-en]
name=TF-IDF English
language=en
backends=tfidf
analyzer=snowball(english)
limit=100
vocab=yso-en
Load a vocabulary:
annif loadvoc tfidf-en /path/to/Annif-corpora/vocab/yso-en.tsv
Train the model:
annif train tfidf-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz
Test the model with a single document:
cat document.txt | annif analyze tfidf-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval tfidf-en /path/to/documents/
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend