SLTev is a tool for comprehensive evaluation of (simultaneous) spoken language translation.
- What is SLTev?
- Requirements
- File-Naming-Convention
- Package Overview
- Evaluating
- Evaluating on elitr-testset
- Evaluating with Your Custom Reference Files
- Parsing index files
- Terminology and Abbreviations
- CREDITS
SLTev is an open-source tool for assessing the quality of spoken language translation (SLT) in a comprehensive way. Based on timestamped golden transcript and reference translation into a target language, SLTev reports the quality, delay and stability of a given SLT candidate output.
SLTev can also evaluate the intermediate steps alone: the output of automatic speech recognition (ASR) and machine translation (MT).
You can see our short presentaion at EACL 2021 - System Demonstration
here: https://slideslive.com/38954658
Full details in the paper (bibtex below): https://www.aclweb.org/anthology/2021.eacl-demos.9
- python3.6 or higher
- some pip-installed modules:
- sacrebleu, sacremoses
- gitpython, gitdir, filelock
- mwerSegmenter
Depending on whether your system produces (spoken language) translation (SLT), or just the speech recognition (ASR), you should use the following naming template of your input and output files.
- <file-name> . <language> . <OSt/OStt>
- e.g.
kaccNlwi6lUCEM.en.OSt
,kaccNlwi6lUCEM.cs.OStt
- <file-name> . <source-language> . <target-language> . <align>
- e.g.
kaccNlwi6lUCEM.en.de.align
- <file-name> . <source-language> . <target-language> . <slt/mt>
- e.g.
kaccNlwi6lUCEM.en.de.slt
,kaccNlwi6lUCEM.cs.en.mt
- <file-name> . <source-language> . <source-language> . <asr/asrt>
- e.g.
kaccNlwi6lUCEM.en.en.asr
Install the Python module (Python 3 only)
pip3 install SLTev
Also, you can install from the source:
python3 setup.py install
- SLTev: Contains scripts for running SLTev
- sample-data: Contains sample input and output files
- test: Test files
SLTev has four types of evaluating modules that each one of which supports multiple input and candidate files and calculates score types.
In the following table, for each module, input, candidate, and score types are shown.
Module | Input types | Candidate types | Score types | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
OStt | OSt | Ref | Align | SLT | MT | ASRT | ASR | Delay | Quality | Flicker | WER | |
SLTeval | X | X | Optional | X | X | X | X | |||||
MTeval | X | X | X | |||||||||
ASReval | X | X | X | X | X | X | X | X | ||||
SLTev -e | X | X | X | Optional | X | X | X | X | X | X | X | X |
Moreover, SLTev works with elitr-testset and can automatically use the growing collection of input files of elitr-testset. Each index in elitr-testset has been created for a specific domain and purpose, containing the list of all relevant files in the documents directory in the elitr-testset. SLTev can generate a simple (flat) directory with all files belonging to an index, so that the user does not have to navigate the directory structure, using SLTev with the -g parameter. When SLTev is called this way for the first time, it clones elitr-testset repository to the local system and copies the desired files to the output directory. In subsequent calls, the local clone of the repository is used. After input files are generated, SLTev can evaluate user hypothesis files with generated input files by its modules.
SLTev works best if you want to evaluate your system on files provided in elitr-testset
(https://github.com/ELITR/elitr-testset).
The procedure is simple:
-
Choose an "index", i.e. a subset of files that you want to test on, here: https://github.com/ELITR/elitr-testset/tree/master/indices We illustrate the rest with
SLTev-sample
as the index. -
Ask SLTev to provide you with the current version of input files:
SLTev -g SLTev-sample --outdir my-evaluation-run-1
# To use your existing checkout of elitr-testset, add -T /PATH/TO/YOUR/elitr-testset
# To populate of elitr-testset links, add ELITR_CONFIDENTIAL_PASSWORD=<password> before SLTev,
# e.g.: ELITR_CONFIDENTIAL_PASSWORD=myPass SLTev -g SLTev-sample --outdir my-evaluation-run-1
-
Run your models on files in
my-evaluation-run-1
and put the outputs into the same directory, with filename suffixes as described above. -
Run SLTev to get the scores:
SLTev -e my-evaluation-run-1/
# To aggregate scores instead of produce score files, add --aggregate
# To reduce the number of scores, add --simple
In order to evaluate a hypothesis with custom files, you can use MTeval
, SLTeval
, ASReval
commands as follow:
Each one of them takes a list of input file paths (-i or --input) and a list of the format of the input files in orders (-f or --file-formats). The input file formats can be chosen from the following items:
- ost: original speech transcribed, i.e. the golden transcript
- ref: reference translation
- ostt: timestamped golden transcript
- slt: timestamped online MT hypothesis, with partial outputs
- mt: finalized MT hypothesis (i.e. one segment per line; segmentation can differ from the reference one)
- align: align files (output of the MGIZA)
- asrt: timestamped ASR hypothesis, with partial outputs
- asr: finalized ASR hypothesis (i.e. one segment per line; segmentation can differ from the golden one)
Please note that candidate files must be at the before or after of their input files. In the following examples, A and B are correct and C is not.
A) SLTeval -i slt_pth ostt_path ref_path -f slt ostt ref
B) SLTeval -i ostt_path ref_path slt_path -f ostt ref slt
C) SLTeval -i ostt_path slt_path ref_path -f ostt slt ref
To evaluate the output of a machine translation system without any timing information, use the following command.
Note that SLTev is not intended for the basic case where MT output segment correspond 1-1 to the reference; SLTev will always resegment in some way.
MTeval -i file1 file2 ... -f file1_format file2_format ...
# To reduce the number of scores, add --simple
Demo example:
git clone https://github.com/ELITR/SLTev.git
cd SLTev
MTeval -i sample-data/sample.en.cs.mt sample-data/sample.cs.OSt -f mt ref
Should give you output like this:
Evaluating the file sample-data/sample.en.cs.mt in terms of translation quality against sample-data/sample.cs.OSt
P ... considering Partial segments in delay and quality calculation (in addition to Complete segments)
T ... considering source Timestamps supplied with MT output
W ... segmenting by mWER segmenter (i.e. not segmenting by MT source timestamps)
A ... considering word alignment (by GIZA) to relax word delay (i.e. relaxing more than just linear delay calculation)
------------------------------------------------------------------------------------------------------------
-- TokenCount reference1 37
avg TokenCount reference* 37
-- SentenceCount reference1 4
avg SentenceCount reference* 4
tot sacreBLEU docAsAWhole 32.786
avg sacreBLEU mwerSegmenter 25.850
Spoken language translation evaluates "machine translation in time". So a time-stamped MT output (slt
) is compared with the reference translation (non-timed, ref
) and the timing of the golden transcript (ostt
).
SLTeval -i file1 file2 ... -f file1_format file2_format ...
# To reduce the number of scores, add --simple
Demo example:
# get sample-data as in the MT example above
SLTeval -i sample-data/sample.en.cs.slt sample-data/sample.cs.OSt sample-data/sample.en.OStt -f slt ref ostt
Should give you:
Evaluating the file sample-data/sample.en.cs.slt in terms of translation quality against sample-data/sample.cs.OSt
...
tot Delay PW 336.845
...
tot Flicker count_changed_content 23
...
tot sacreBLEU docAsAWhole 32.786
...
In basic speech recognition evaluation, timing is ignored. For this type of evaluation, use the following command and provide ASR output (asr
) and the golden transcript without timestamps (ost
):
ASReval -i file1 file2 ... -f file1_format file2_format ...
# To reduce the number of scores, add --simple
Demo example:
# get sample-data as in the MT example above
ASReval -i sample-data/sample.en.en.asr sample-data/sample.en.OSt -f asr ost
Should give you:
Evaluating the file sample-data/sample.en.en.asr in terms of WER score against sample-data/sample.en.OSt
-------------------------------------------------------------
L ... lowercasing
P ... removing punctuation
C ... concatenating all sentences
W ... using mwersegmemter
M ... using Moses tokenizer
-------------------------------------------------------------
LPC 0.265
LPW 0.274
WM 0.323
Here we learn that the WER score (lower is better) for this sample file varies between .265 and .323 depending on the pre-processing technique. In ASR research, the most common pre-processing strategy is what we call LPW, i.e. lowecase, remove punctuation and use mWERsegmenter to mimic the segmentation of the reference transcript. If we consider casing and punctuation (labelled WM), the score gets naturally worse.
ASRT is like SLT but in the source language, i.e. evaluating the time-stamped output of an ASR system (asrt
) against the golden transcript which has to be provided twice: without timestamps (ost
) and with timing and partial segments (ostt
). All the files are in the same language and the ost
file must have the exact same number of segments as there are "C"omplete segments in the ostt
file.
ASReval -i file1 file2 ... -f file1_format file2_format ...
# To reduce the number of scores, add --simple
Demo example:
ASReval -i sample-data/sample.en.en.asrt sample-data/sample.en.OSt sample-data/sample.en.OStt -f asrt ost ostt
If you have a file with multiple documents, you can use SLTev modules with --docs
parameter for evaluation. You need to add a separation token to the input and candidate files to separate documents (default is ###docSpliter###
). Please notice that all input and candidate files must contain the separation token and the number of documents must be equal. Also, the language of each document in a multi-docs file should be equal.
SLTeval/ASReval/MTeval -i file1 file2 ... -f file1_format file2_format ... --docs
# To reduce the number of scores, add --simple
# To use multi-docs evaluation, add --docs
# To use your separation token, add --splitby "YOURTOKEN" (default is "###docSpliter###")
- *.asrt and *.slt files have timestamps and, *.mt and *.asr do not have them.
- For using
MTeval
,SLTeval
,ASReval
commands, you do not need to follow naming templates, it is the-f
parameter that specifies the use of the file. - You can evaluate several hypotheses at once. Also, you can use short file formats. For example, the following commands are equal:
MTeval -i file1 hypo1 file2 hypo2 -f ref mt ref mt
OR
MTeval -i file1 hypo1 file2 hypo2 -f ref mt
- You can use the pipeline as input instead of
-i
parameter, for example, the following commands are equal:
MTeval -i file1 hypo1 file2 hypo2 -f ref mt
OR
echo "file1 hypo1" | MTeval -f ref mt
See SLTev/index_parser.py
for detailed description. Structure of the index file:
# SRC -> *.<EXTENSION>
# REF -> *.<EXTENSION>
# ALIGN -> *.<EXTENSION>
PATH_TO_DIRECTORY
PATH_TO_ANOTHER_DIRECTORY_WITH_SAME_EXTENSIONS
# SRC -> *.<EXTENSION>
# REF -> *.<EXTENSION>
PATH_TO_DIRECTORY_WITH_DIFFERENT_EXTENSIONS
SRC
and REF
annotations are mandatory. Specifying a SRC
annotation "clears" the rest of the annotations.
Usage:
SLTIndexParser path_to_index_file path_to_dataset
- OSt ... original speech manually transcribed (i.e. golden transcript)
- OStt ... original speech manually transcribed with word-level timestamps
- mt ... the unrevised output of text-based translation; the source of MT can be .asr (machine-transcribed OS) or .OSt (human-transcribed OS)
- slt ... timestamped online MT hypothesis, i.e. the output of an MT system ran in online mode, with timestamps recorded
- asr ... the unrevised output of a speech recognition system
- asrt ... the unrevised output of speech recognition system; timestamped at the word level
If you use SLTev, please cite the following:
@inproceedings{ansari-etal-2021-sltev,
title = "{SLTEV}: Comprehensive Evaluation of Spoken Language Translation",
author = "Ansari, Ebrahim and
Bojar, Ond{\v{r}}ej and
Haddow, Barry and
Mahmoudi, Mohammad",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.eacl-demos.9",
pages = "71--79",
}