How Much Does Tokenization Affect Neural Machine Translation? Domingo et al., 2018

Paper, Tags: #machine-translation

Separating punctuation and splitting tokens into words or subwords has proven to be helpful to reduce vocabulary and increase the number of examples of each word, improving the translation quality.

Tokenization is more challenging when dealing with languages with no separator between words. We experiment on 5 tokenizers over 10 language pairs, to assess the impact of the tokenization in the quality of the final translation on NMT.

Tokenizers

SentencePiece: can be used for any language, but its models need to be trained.
Mecab: for Japanese, based on CRF
Stanford Word segmenter: for Chinese, based on CRF
OpenNMT tokenizer: for any language. It normalizes characters and splits Chinese tokens into words.
Moses tokenizer: for any language. It separates puctuation from word and normalizes characters

Language pairs

Japanese - English
Russian - English
Chinese - English
German - English
Arabic - English

Results

Ja-En: Moses tokenizer was the best. The segmentation of the target side has a greater impact on the translation quality. En-Ja: Mecab
Russian: Moses, but OpenNMT and SentencePiece had similar results
Chinese: similar to Japanese, Moses and Stanford words segmenter had the best results
German: OpenNMT and Moses
Arabic: similar to Russian, Moses the best

Specialized morphologically oriented tokenizers achieve the best results when evaluated on that language pair. OpenNMT and SentencePiece had the worst results in all experiments.

OpenNMT only separates punctuation symbols from words. It might contain many repetitions and lack sense.
SentencePiece has a very large corpora to train how to segment from the corpora's training data. Sometimes it contains repetitions and some words are missing.

The systems using SentencePiece and OpenNMT translated wrongly. The system using SentencePiece translated all the words from the source but its translation is not grammatically correct.

Conclusion

Tokenization has a great impact on the translation quality, achieving gains of up to 12 points of BLEU and 15 points of TER. We also observe there is not a single best tokenizer. Each one produced the best results for certain language pairs.

Future work: evaluate the relation between the size of the corpora and the quality yielded by SentencePiece, which uses each language's training corpora to learn how to segment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1812.08621.md

1812.08621.md

How Much Does Tokenization Affect Neural Machine Translation? Domingo et al., 2018

Paper, Tags: #machine-translation

Tokenizers

Language pairs

Results

Conclusion

Files

1812.08621.md

Latest commit

History

1812.08621.md

File metadata and controls

How Much Does Tokenization Affect Neural Machine Translation? Domingo et al., 2018

Paper, Tags: #machine-translation

Tokenizers

Language pairs

Results

Conclusion