-
Notifications
You must be signed in to change notification settings - Fork 43
Backend: PAV
The pav
backend implements a trainable dynamic ensemble that intelligently combines results from multiple projects. Subject suggestion requests to the ensemble backend will be re-routed to the source projects. The results from the source projects will be re-weighted using isotonic regression, which attempts to convert raw scores to probabilities. The regression is implemented using the PAV algorithm available in the scikit-learn library. The regression is performed separately for each concept and the results are combined by calculating the mean of regressed scores (i.e. estimated probabilities) for each concept.
Note
See nn_ensemble for an alternative dynamic ensemble backend that can also be further trained during use, unlike PAV.
[pav-en]
name=PAV ensemble English
language=en
backend=pav
sources=tfidf-en,mllm-en
min-docs=3
limit=100
vocab=yso
The sources
setting is a comma-separated list of projects whose results will be combined. Optional weights may be given like this:
sources=tfidf-en:1,mllm-en:2
This setting would give twice as much weight on results from mllm-en
compared to results from tfidf-en
.
The min-docs
setting specifies how many positive examples of a concept are required in the training data in order to create a regression model for that concept. Recommended values are between 3 and 10. When not enough positive examples are available, raw scores are used instead, similar to the basic ensemble backend.
Load a vocabulary:
annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl
Train the ensemble:
annif train pav-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz
Test the model with a single document:
cat document.txt | annif suggest pav-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval pav-en /path/to/documents/
- 🧑💻 Introduction & Getting Started
- 🚀 Deployment
- 🖥️ User Interfaces
- ⚙️ Preprocessing & Supporting Features
- 🎯 Optimization Techniques
- 🧩 Backends
- 🛠️ Development & Contribution
- 🆘 Troubleshooting & Support