Benchmarking Finnish POS taggers and lemmatizers

This repository contains an evaluation of the accuracy of open source Finnish part-of-speech taggers and lemmatization algorihtms.

Tested algorithms

spaCy 3.3.0
Experimental Finnish model for spaCy 0.10.0
FinnPos git commit 81c1f735 (Oct 2019)
Simplemma 0.6.0
Stanza 1.4.0
Trankit 1.1.1
Turku neural parser pipeline git commit 8c9425dd (Jan 2022)
UDPipe (through spacy-udpipe 1.0.0)
UralicNLP 1.3.0
libvoikko 4.3.1 and Python voikko module 0.5
Raudikko git commit 572b8104 (Jan 2022)

Test datasets

FinnTreeBank 1 v1: randomly sampled subset of about 1000 sentences
FinnTreeBank 2: news, Sofie and Wikipedia subsets
UD_Finnish-TDT r2.9: the testset

Setup

Install dependencies:

Python 3.9
libvoikko with Finnish morphology data files
clang (or other C++ compiler)
Dependencies needed to compile FinnPos and cg3
Java 11

Setup git submodules, create a Python 3.9 (must be 3.9 because the Turku parser is incompatible with more recent Python versions) virtual environment and download test data and models by running the following commands:

git submodule init
git submodule update

python3.9 -m venv venv
source venv/bin/activate
pip install wheel
pip install -r requirements.txt

# Compile FinnPos
(cd models/FinnPos/src && make -j 4)

# Compile cg3 in models/cg3
# See https://visl.sdu.dk/cg3/chunked/installation.html

# Compile Raudikko
(cd models/raudikko && ./gradlew shadowJar)

./download_data.sh
./download_models.sh

Run

./run.sh

The numerical results will be saved in results/evaluation.csv, POS and lemma errors made by each model will be saved in results/errorcases, and plots will be saved in results/images.

Results

Lemmatization

Execution duration as a function of the F1 score on the concatenated data. Larger values are better on both axes. Notice that the Y-axis is on log scale.

The execution duration is measured as a batched evaluation (a batch contains all sentences from one dataset) on a 4 core CPU. Some methods can be run on a GPU which most likely would improve their performance, but I haven't tested that.

Lemmatization F1 scores for the benchmarked algorithms on the test datasets.

Part-of-speech tagging

Execution duration as a function of the POS F1 score on the concatenated data.

Note that FinnPos and Voikko do not make a distinction between auxiliary and main verbs and therefore their performance suffers by 4-5% in this evaluation as they mislabel all AUX tags as VERBs.

Part-of-speech F1 scores for the benchmarked algorithms.

Simplemma does not include a POS tagging feature.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
images		images
models		models
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
conll18_ud_eval.py		conll18_ud_eval.py
datasets.py		datasets.py
download_data.sh		download_data.sh
download_models.py		download_models.py
download_models.sh		download_models.sh
evaluate.py		evaluate.py
install_finnpos.sh		install_finnpos.sh
install_turku_pipeline.sh		install_turku_pipeline.sh
nlpmodels.py		nlpmodels.py
plot_results.py		plot_results.py
predict.py		predict.py
preprocess_data.py		preprocess_data.py
print_errors.py		print_errors.py
requirements.txt		requirements.txt
run.sh		run.sh
sample_ftb1.py		sample_ftb1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Finnish POS taggers and lemmatizers

Tested algorithms

Test datasets

Setup

Run

Results

Lemmatization

Part-of-speech tagging

About

Releases

Packages

Languages

License

aajanki/finnish-pos-accuracy

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Finnish POS taggers and lemmatizers

Tested algorithms

Test datasets

Setup

Run

Results

Lemmatization

Part-of-speech tagging

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages