1️⃣Pre-training tasks(Referenced from)
-
😀Language modeling (LM)
A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability $P(w_1, ... , w_m) $ to the whole sequence.
-
😂Masked language modeling (MLM)
MLM first masks out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of the tokens. (可以认为是DAE的一种)
-
🏄♂️Enhanced masked language modeling (E-MLM):
there are multiple research proposing different enhanced versions of MLM to further improve on BERT.UniLM extends the task of mask prediction on three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. XLM performs MLM on a concatenation of parallel bilingual sentence pairs, called Translation Language Modeling (TLM). SpanBERT replaces MLM with Random Contiguous Words Masking and Span Boundary Objective (SBO) to integrate structure information into pre-training, which requires the system to predict masked spans based on span boundaries. Besides, StructBERT introduces the Span Order Recovery task to further incorporate language structures.Another way to enrich MLM is to incorporate external knowledge.
-
-
😊Permuted language modeling (PLM)
PLM is a language modeling task on a random permutation of input sequences. A permutation is randomly sampled from all possible permutations. Then some ofthe tokens in the permuted sequence are chosen as the target, and the model is trained to predict these targets, depending on the rest of the tokens and the natural positions of targets.
-
😆Denoising autoencoder (DAE)
Denoising autoencoder (DAE) takes a partially corrupted input and aims to recover the original undistorted input.
There are several ways to corrupt text:
- 🏄♂️Token masking: Randomly sampling tokens from theinput and replacing them with [MASK] elements.
- 🤾♂️Token deletion: Randomly deleting tokens from the input. Different from token masking, the model needs to decide the positions of missing inputs.
- 🏋️♂️Text infilling: Like SpanBERT, a number of text spans are sampled and replaced with a single [MASK] token. Each span length is drawn from a Poisson distribution (λ = 3). The model needs to predict how many tokens are missing from a span.
- 🚴♀️Sentence permutation: Dividing a document into sentences based on full stops and shuffling these sentences in random order.
- 🚣♀️Document rotation: Selecting a token uniformly at random and rotating the document so that it begins with that to- ken. The model needs to identify the real start
-
🙂Contrastive learning (CTL)
Contrastive learning assumes some observed pairs of text that are more semantically similar than randomly sampled text. The idea behind CTL is "learning by comparison".
There are some recently proposed CTL tasks:
- 🏄♂️Deep InfoMax (DIM): Deep InfoMax is originally proposed for images, which improves the quality of the representation by maximizing the mutual information between an image representation and local regions of the image.
- 🤾♂️Replaced token detection (RTD): Replaced token detection predicts whether a token is replaced given its surrounding context.
- 🏋️♂️Next sentence prediction (NSP): NSP trains the model to distinguish whether two input sentences are continuous segments from the training corpus.
- 🚴♀️Sentence order prediction (SOP): SOP uses two consecutive segments from the same document as positive examples, and the same two consecutive segments but with their order swapped as negative examples.
-
😵Others
Apart from the above tasks, there are many other auxiliary pre-training tasks designated to incorporate factual knowledge, improve cross-lingual tasks, multi-modal applications, or other specific tasks.
- 🏄♂️Knowledge-enriched PTMs
- 🤾♂️Multilingual and language-specific PTMs
- 🏋️♂️Multi-modal PTMs
- 🚴♀️Domain-specific and task-specific PTMs
模型名称 | 参数量 | 语言 | 模型架构 | 训练目标 | 训练语料 | 链接 | 其他 |
---|---|---|---|---|---|---|---|
BERT-base | 110M | 英/中/多语言 | Transformer-encoder | MLM+NSP | BooksCorpus+Wikipedia (16 GB) | github | |
BERT-large | 340M | 英 | Transformer-encoder | MLM+NSP | BooksCorpus+Wikipedia | github | |
StructBERT-base | 110M | / | Transformer-encoder | MLM+NSP+SOP | BooksCorpus+Wikipedia | github | |
StructBERT-large | 340M, 330M | 英/中 | Transformer-encoder | MLM+NSP+SOP | BooksCorpus+Wikipedia | github | |
SpanBERT | / | / | Transformer-encoder | MLM(SBO) | BooksCorpus+Wikipedia | github | |
ALBERT-base,large, xlarge, xxlarge |
12M,18M, 59M,233M | 中/英 | Transformer-encoder | MLM+SOP | BooksCorpus+Wikipedia | github | |
BART-base, large,m |
140M, 400M, 610M | 英/多语言 | Transformer | DAE | BooksCorpus+Wikipedia | github | |
RoBERTa-base,large | 125M, 355M | 英 | Transformer-encoder | MLM(动态) | BooksCorpus+Wikipedia + CC-NEWS(76 GB)+ OPENWEBTEXT(38 GB)+STORIES(31 GB) | github | |
XLM | / | 多语言 | Transformer | MLM,TLM | Wikipedia+MultiUN+IIT Bombay+OPUS website:(EUbookshop, OpenSubtitles2018, Tanzil, GlobalVoices) | github | |
ELECTRA-small,base ,large |
14M,110M, 335M | 英 | Generator+ Discriminator | RTD | BooksCorpus+Wikipedia, BooksCorpus+Wikipedia+ ClueWeb+CommonCrawl+ Gigaword | github | 中文版 |
ERNIE-THU | / | 英 | Transformer-encoder+ KG | MLM+NSP+ KG融合 | BooksCorpus+Wikipedia +Wikidata | github | |
ERNIE 3.0 | 10B | 中/英 | Transformer-encoder+KG | MLM (Knowledge Masking) | Chinese text corpora (4TB) 11 different categories. | github | 未公开 |
MASS | 120M | 翻译 | Transformer | Seq2Seq-MLM | WMT16+WMT News Crawl dataset | github | |
Wu Dao 2.0 | 1.75T(涵盖很多模型) | 中/英/双语 | \ | \ | WuDaoCorpus 4.9 TB | Official website | 可下载 |
CPM-2,MoE | 11B,198B | 中/英 | Transformer-encoder+KG | Span MLM | WuDaoCorpus (zh:2.3 TB; en:300 GB) | Official website | |
UniLM v2 | 110M | 英 | Transformer-encoder | MLM+NSP | BooksCorpus+Wikipedia + CC-NEWS+ OpenWebText+Stories | github | |
M6 | 100B | 中文-多模态 | \ | \ | images(1.9 TB), texts(292 GB) | github | |
T5-small, base, large, 3B,11B |
60M,220M, 770M,3B,11B | 英 | Transformer | Span MLM | Common Crawl(750 GB) | github | |
CODEX | 12B | code | Transformer-decoder | 基于GPT-3微调 | Github Python files (159 GB) | copilot | |
XLNet-base,large | similar with bert | 英 | Transformer-encoder | PLM | BooksCorpus+Wikipedia +Giga5+ClueWeb 2012-B,Common Crawl | github | 中文版 |
GPT | 117M | 英 | Transformer-decoder | LM | BooksCorpus | paper | |
GPT-2 | 1.5B | 英 | Transformer-decoder | LM | Common Crawl(40 GB) | github | |
GPT-3 | 175B | 英 | Transformer-decoder | LM | Common Crawl + WebText datase+two internet-based books corpora+ English-language Wikipedia(570 GB-45 TB raw) | Official website | 付费 |
- BERT:参数量110-340M(Bert系列及其变种等小型PLM一般都已开源参数,可下载本地使用)
- T5:参数量11B,模型大小约15GB,可下载本地使用
- GPT-2:参数量1.5B,付费
- GPT-3:参数量175B,付费,0.7-0.01RMB/1K TOKENS,付费方式(中国不在申请地区)
- 华为盘古:参数量200B,未开放,在咨询
- 百度ERNIE3.0:参数量10B,在咨询
- RoBERTa:参数量125-355M,可下载使用
- ALBERT:参数量125M
- 悟道2.0-GLM(General Language Model):参数10B,申请下载使用
- 悟道2.0-CPM(Chinese Pretrained Models):参数2.6,11,198B,申请下载使用
- BART:参数量400M,可下载使用
4️⃣Transformer预训练模型适用任务汇总(Referenced from PaddleNLP)
Model | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice |
---|---|---|---|---|---|
ALBERT | ✅ | ✅ | ✅ | ❌ | ✅ |
BART | ✅ | ✅ | ✅ | ✅ | ❌ |
BERT | ✅ | ✅ | ✅ | ❌ | ✅ |
BigBird | ✅ | ✅ | ✅ | ❌ | ✅ |
Blenderbot | ❌ | ❌ | ❌ | ✅ | ❌ |
Blenderbot-Small | ❌ | ❌ | ❌ | ✅ | ❌ |
ConvBert | ✅ | ✅ | ✅ | ✅ | ✅ |
CTRL | ✅ | ❌ | ❌ | ❌ | ❌ |
DistilBert | ✅ | ✅ | ✅ | ❌ | ❌ |
ELECTRA | ✅ | ✅ | ❌ | ❌ | ✅ |
ERNIE | ✅ | ✅ | ✅ | ❌ | ❌ |
ERNIE-DOC | ✅ | ✅ | ✅ | ❌ | ❌ |
ERNIE-GEN | ❌ | ❌ | ❌ | ✅ | ❌ |
ERNIE-GRAM | ✅ | ✅ | ✅ | ❌ | ❌ |
GPT | ✅ | ✅ | ❌ | ✅ | ❌ |
LayoutLM | ✅ | ✅ | ❌ | ❌ | ❌ |
LayoutLMV2 | ❌ | ✅ | ❌ | ❌ | ❌ |
LayoutXLM | ❌ | ✅ | ❌ | ❌ | ❌ |
Mbart | ✅ | ❌ | ✅ | ❌ | ✅ |
MobileBert | ✅ | ❌ | ✅ | ❌ | ❌ |
MPNet | ✅ | ✅ | ✅ | ❌ | ✅ |
NeZha | ✅ | ✅ | ✅ | ❌ | ✅ |
ReFormer | ✅ | ❌ | ✅ | ❌ | ❌ |
RoBERTa | ✅ | ✅ | ✅ | ❌ | ❌ |
RoFormer | ✅ | ✅ | ✅ | ❌ | ❌ |
SKEP | ✅ | ✅ | ❌ | ❌ | ❌ |
SqueezeBert | ✅ | ✅ | ✅ | ❌ | ❌ |
T5 | ❌ | ❌ | ❌ | ✅ | ❌ |
TinyBert | ✅ | ❌ | ❌ | ❌ | ❌ |
UnifiedTransformer | ❌ | ❌ | ❌ | ✅ | ❌ |
XLNet | ✅ | ✅ | ❌ | ❌ | ❌ |