Skip to content

SOTA Transformers

David N. Palacio edited this page May 19, 2021 · 4 revisions

This list corresponds to a filtered literature search in Transformer-based Approaches

Title Date Description Usages
BERT 11 Oct 2018 Breakout paper showing clever pretraining task can significantly improve results over other approaches on multiple downstream tasks. Showed the unsupervised task of denoising corrupted sentences (ones where portions of the sentence were masked out and had the model predict the masked out portions) taught the model a rich representation of words used in different contexts including the semantics of sentences. Useful for learning representation of sentences for downstream tasks like question and answering, classification, name entity recognition, etc.
GPT-2 14 Feb 2019 Showed that an autoregressive generative pretraining task, instead of a denoising one that BERT uses, can achieve good results without even fine-tuning the model for a downstream task such as translation. Simply training the model to predict the next word given the preceding words allowed the model to learn complex relations of words and could allow you to translate from one language to another without explicitly training it to do so. The paper also showed that using a big enough transformer network and training it on this autoregressive generative pretraining task allows the model to generate extremely coherent and interesting articles. Useful for generating realistic text and can also learn a rich representation of sentences for other downstream tasks, but not as well as BERT or ELECTRA usually.
RoBERTa 26 Jul 2019 Many papers after BERTs rise to fame tried to dethrown it by using newer pretraining techniques such as XLNet to improve performance. However, RoBERTa showed most of these papers were incorrectly comparing the models and showed that optimizing BERT by training it with more data and for longer still achieved better or comparable performance to these models that claimed to have bested BERT. More optimized BERT, so useful for learning representation of sentences for downstream tasks like question and answering, classification, named entity recognition, etc.
CTRL 11 Sep 2019 Showed that you could have more fine-grained control of a generative model like GPT-2 by introducing control codes that allowed you to signal to the model the type of output you wished the model to generate. Examples include using a control code that had the model generate text that was more formal such as Wikipedia articles or news article versus it generating text that was less formal such as in reviews of movies. They also showed you could use control codes to teach the model to translate text from one language to another. More of a proof of concept, not extremely practical.
DistilBERT 2 Oct 2019 Showed you could shrink the size of a very large BERT model down to a small model by using the large BERT model as a teacher and the smaller model as a student. This showed you could significantly shrink these models down without a huge impact On performance. Ability to use a smaller but still very powerful model for usage in resource constraint contexts.
T5 23 Oct 2019 Showed the power of control codes by showing you could use these control codes to create a fully functional multi-task model in a supervised setting versus the unsupervised setting that GPT-2 used. They showed that you could train the model in a supervised setting to get state of the art results on multiple NLP tasks, question answering, named entity recognition, translation, summarization of documents, etc with a single model without any modification just using these control codes to modify what the model generated. Use a single model for multiple tasks without any modifications using different control codes to signal to the model which tasks to complete.
BART 29 Oct 2019 Essentially combined the power of BERT in sentence understanding with the power of GPT-2 in generating extremely coherent and realistic text. Showed this type of architecture can perform extremely well on tasks such as summarization of documents and question answering and translation. Whenever the task at hand is a sequence to sequence task such as translating from one language to another or something like summarization, this model has shown to be very useful
Reformer 13 Jan 2020 One issue with the Transformer architecture that all of the recent architectures use is that the Attention mechanism is O(n^2) for memory where n is the sequence length. This makes it impossible to use extremely long sequences to train these Transformers such as books. The Reformer attempts to solve this issue by reducing the dimensionality that this attention mechanism needs to be performed on by applying latent semantic hashing to the hidden layers of the Transformer architecture and only performing the Attention operation on groups of semantically similar embeddings. This allows for extremely long sequences 500,000 (compared to the usual 512 of other models) to be used for tasks. Can use to learn rich semantic representations of extremely long documents for downstream tasks such as answering questions of books or for categorizing very long articles.
ELECTRA 23 Mar 2020 Substitute for BERT pretraining scheme that is more efficient allowing the model to learn quicker with fewer examples. The main idea is to use a small BERT model that replaces some parts of a sentence with plausible, but a bit non-syncical or less plausible words such as "the chef cooked the meal" is transformed to "the chef ate the meal." Both are plausible, but it makes more sense for the chef to have cooked the meal than to have eaten the meal. A separate model is then required to detect which token was replaced. Drop in replacement for BERT essentially allowing for better representation of sentences for downstream tasks like question and answering, classification, named entity recognition, etc. However, can be trained a lot faster, even on just a single GPU in only four days compared to BERT which would take multiple GPUs for like a week.
Longformer 10 Apr 2020 Similar goal of the Reformer paper, but instead of latent semantic hashing, the authors introduce a new Attention mechanism that is O(n) instead of O(n^2). Can use to learn rich semantic representations of extremely long documents for downstream tasks such as answering questions of books or for categorizing very long articles.