tags | ||
---|---|---|
|
Third version of LayoutLM introduced by Huang et al.. Compared to the second version, v3 simplifies the model architecture to a single Transformer avoiding the need of CNN-based visual encoder.
The Transformer receives text and image tokens. Text tokens are those obtained by OCR accompanied by 1D positional embeddings and 2D layout embeddings, where the layout embeddings refer to the bounding box of a text segment, where the token was found. In other words, tokens in the same segment have the same 2D layout embeddings.
Image tokens are flattened-out parts of resized image processed by a linear layer. This comes from Visual Transformer (ViT) and (ViLT). (TODO)
V3 uses the same spatial-aware self-attention as did v2.
The model is pretrained using 3 losses: MLM, MIM, WPA.
Adjusted Masked Language Modelling (MLM) loss masks out the semantic parts of text embedding (1D and 2D embeddings are left unmasked). The goal is to predict masked text tokens.
Masked Image Modelling (MIM) is a mirror image of MLM but for image. The loss masks out some image tokens and trains the model to predict lower-dimensional representation of the masked out token. (TODO: connection to DALLE through the mentioned paper)
Word-Patch Alignment (WPA) loss forces the model to align the two modalities. The model's goal is to predict for unmasked text tokens if their corresponding image patch is masked by MIM.