Skip to content

Latest commit

 

History

History
15 lines (8 loc) · 1.48 KB

2004.08728.md

File metadata and controls

15 lines (8 loc) · 1.48 KB

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings, Sabet et al., 2020

Paper, Tags: #nlp

Word alignments are extensively used for tasks like statistical and neural machine translation. But most approaches require parallel training data, and for smaller data, quality decreases. We propose word alignment methods that require no parallel data, by leveraging multilingual word embeddings (static and contextualized) for word alignments. These multilingual word embeddings are generated from monolingual data.

For example, for a set of 100k parallel sentences, contextualized embeddings achieve a word alignment F1 for English-German that is more than 5% higher (absolute) than eflomal, a high quality alignment model.

(see introduction for alignment references). Fastalign reference Eflomal. See also section 5 for related work literature.

This paper proposes three methods to obtain alignments from similarity matrices: ArgMax, IterMax and Match (see paper for more info).

It also proposes two post-processing algorithms that handle null words and integrate positional information.

We hypothesize that subword processing is beneficial for aligning rare words. For rare words both eflomal and mBERT show a severely decreased performance on word level, but not on subword levels.