Skip to content
Juho Inkinen edited this page Jun 30, 2025 · 14 revisions

The pav backend implements a trainable dynamic ensemble that intelligently combines results from multiple projects. Subject suggestion requests to the ensemble backend will be re-routed to the source projects. The results from the source projects will be re-weighted using isotonic regression, which attempts to convert raw scores to probabilities. The regression is implemented using the PAV algorithm available in the scikit-learn library. The regression is performed separately for each concept and the results are combined by calculating the mean of regressed scores (i.e. estimated probabilities) for each concept.

Note

See nn_ensemble for an alternative dynamic ensemble backend that can also be further trained during use, unlike PAV.

Example configuration

[pav-en]
name=PAV ensemble English
language=en
backend=pav
sources=tfidf-en,mllm-en
min-docs=3
limit=100
vocab=yso

The sources setting is a comma-separated list of projects whose results will be combined. Optional weights may be given like this:

sources=tfidf-en:1,mllm-en:2

This setting would give twice as much weight on results from mllm-en compared to results from tfidf-en.

The min-docs setting specifies how many positive examples of a concept are required in the training data in order to create a regression model for that concept. Recommended values are between 3 and 10. When not enough positive examples are available, raw scores are used instead, similar to the basic ensemble backend.

Usage

Load a vocabulary:

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

Train the ensemble:

annif train pav-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz

Test the model with a single document:

cat document.txt | annif suggest pav-en

Evaluate a directory full of files in fulltext document corpus format:

annif eval pav-en /path/to/documents/

← Ensemble | nn_ensemble →

Clone this wiki locally