Paper, Tags: #machine-translation
Separating punctuation and splitting tokens into words or subwords has proven to be helpful to reduce vocabulary and increase the number of examples of each word, improving the translation quality.
Tokenization is more challenging when dealing with languages with no separator between words. We experiment on 5 tokenizers over 10 language pairs, to assess the impact of the tokenization in the quality of the final translation on NMT.
- SentencePiece: can be used for any language, but its models need to be trained.
- Mecab: for Japanese, based on CRF
- Stanford Word segmenter: for Chinese, based on CRF
- OpenNMT tokenizer: for any language. It normalizes characters and splits Chinese tokens into words.
- Moses tokenizer: for any language. It separates puctuation from word and normalizes characters
- Japanese - English
- Russian - English
- Chinese - English
- German - English
- Arabic - English
- Ja-En: Moses tokenizer was the best. The segmentation of the target side has a greater impact on the translation quality. En-Ja: Mecab
- Russian: Moses, but OpenNMT and SentencePiece had similar results
- Chinese: similar to Japanese, Moses and Stanford words segmenter had the best results
- German: OpenNMT and Moses
- Arabic: similar to Russian, Moses the best
Specialized morphologically oriented tokenizers achieve the best results when evaluated on that language pair. OpenNMT and SentencePiece had the worst results in all experiments.
- OpenNMT only separates punctuation symbols from words. It might contain many repetitions and lack sense.
- SentencePiece has a very large corpora to train how to segment from the corpora's training data. Sometimes it contains repetitions and some words are missing.
The systems using SentencePiece and OpenNMT translated wrongly. The system using SentencePiece translated all the words from the source but its translation is not grammatically correct.
Tokenization has a great impact on the translation quality, achieving gains of up to 12 points of BLEU and 15 points of TER. We also observe there is not a single best tokenizer. Each one produced the best results for certain language pairs.
Future work: evaluate the relation between the size of the corpora and the quality yielded by SentencePiece, which uses each language's training corpora to learn how to segment.