SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings, Sabet et al., 2020

Paper, Tags: #nlp

Word alignments are extensively used for tasks like statistical and neural machine translation. But most approaches require parallel training data, and for smaller data, quality decreases. We propose word alignment methods that require no parallel data, by leveraging multilingual word embeddings (static and contextualized) for word alignments. These multilingual word embeddings are generated from monolingual data.

For example, for a set of 100k parallel sentences, contextualized embeddings achieve a word alignment F1 for English-German that is more than 5% higher (absolute) than eflomal, a high quality alignment model.

(see introduction for alignment references). Fastalign reference Eflomal. See also section 5 for related work literature.

This paper proposes three methods to obtain alignments from similarity matrices: ArgMax, IterMax and Match (see paper for more info).

It also proposes two post-processing algorithms that handle null words and integrate positional information.

We hypothesize that subword processing is beneficial for aligning rare words. For rare words both eflomal and mBERT show a severely decreased performance on word level, but not on subword levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2004.08728.md

2004.08728.md

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings, Sabet et al., 2020

Paper, Tags: #nlp

Files

2004.08728.md

Latest commit

History

2004.08728.md

File metadata and controls

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings, Sabet et al., 2020

Paper, Tags: #nlp