GitHub - BIT-Xu/Pre-trained-Language-Models

1️⃣Pre-training tasks(Referenced from)

😀Language modeling (LM)

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability $P(w_1, ... , w_m) $ to the whole sequence.
😂Masked language modeling (MLM)

MLM first masks out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of the tokens. (可以认为是DAE的一种)
- 🏄‍♂️Enhanced masked language modeling (E-MLM):
  
  there are multiple research proposing different enhanced versions of MLM to further improve on BERT.UniLM extends the task of mask prediction on three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. XLM performs MLM on a concatenation of parallel bilingual sentence pairs, called Translation Language Modeling (TLM). SpanBERT replaces MLM with Random Contiguous Words Masking and Span Boundary Objective (SBO) to integrate structure information into pre-training, which requires the system to predict masked spans based on span boundaries. Besides, StructBERT introduces the Span Order Recovery task to further incorporate language structures.Another way to enrich MLM is to incorporate external knowledge.
😊Permuted language modeling (PLM)

PLM is a language modeling task on a random permutation of input sequences. A permutation is randomly sampled from all possible permutations. Then some ofthe tokens in the permuted sequence are chosen as the target, and the model is trained to predict these targets, depending on the rest of the tokens and the natural positions of targets.
😆Denoising autoencoder (DAE)

Denoising autoencoder (DAE) takes a partially corrupted input and aims to recover the original undistorted input.

There are several ways to corrupt text:
- 🏄‍♂️Token masking: Randomly sampling tokens from theinput and replacing them with [MASK] elements.
- 🤾‍♂️Token deletion: Randomly deleting tokens from the input. Different from token masking, the model needs to decide the positions of missing inputs.
- 🏋️‍♂️Text infilling: Like SpanBERT, a number of text spans are sampled and replaced with a single [MASK] token. Each span length is drawn from a Poisson distribution (λ = 3). The model needs to predict how many tokens are missing from a span.
- 🚴‍♀️Sentence permutation: Dividing a document into sentences based on full stops and shuffling these sentences in random order.
- 🚣‍♀️Document rotation: Selecting a token uniformly at random and rotating the document so that it begins with that to- ken. The model needs to identify the real start
🙂Contrastive learning (CTL)

Contrastive learning assumes some observed pairs of text that are more semantically similar than randomly sampled text. The idea behind CTL is "learning by comparison".

There are some recently proposed CTL tasks:
- 🏄‍♂️Deep InfoMax (DIM): Deep InfoMax is originally proposed for images, which improves the quality of the representation by maximizing the mutual information between an image representation and local regions of the image.
- 🤾‍♂️Replaced token detection (RTD): Replaced token detection predicts whether a token is replaced given its surrounding context.
- 🏋️‍♂️Next sentence prediction (NSP): NSP trains the model to distinguish whether two input sentences are continuous segments from the training corpus.
- 🚴‍♀️Sentence order prediction (SOP): SOP uses two consecutive segments from the same document as positive examples, and the same two consecutive segments but with their order swapped as negative examples.
😵Others

Apart from the above tasks, there are many other auxiliary pre-training tasks designated to incorporate factual knowledge, improve cross-lingual tasks, multi-modal applications, or other specific tasks.
- 🏄‍♂️Knowledge-enriched PTMs
- 🤾‍♂️Multilingual and language-specific PTMs
- 🏋️‍♂️Multi-modal PTMs
- 🚴‍♀️Domain-specific and task-specific PTMs

2️⃣Pre-trained LMs

模型名称	参数量	语言	模型架构	训练目标	训练语料	链接	其他
BERT-base	110M	英/中/多语言	Transformer-encoder	MLM+NSP	BooksCorpus+Wikipedia (16 GB)	github
BERT-large	340M	英	Transformer-encoder	MLM+NSP	BooksCorpus+Wikipedia	github
StructBERT-base	110M	/	Transformer-encoder	MLM+NSP+SOP	BooksCorpus+Wikipedia	github
StructBERT-large	340M, 330M	英/中	Transformer-encoder	MLM+NSP+SOP	BooksCorpus+Wikipedia	github
SpanBERT	/	/	Transformer-encoder	MLM(SBO)	BooksCorpus+Wikipedia	github
ALBERT-base,large, xlarge, xxlarge	12M,18M, 59M,233M	中/英	Transformer-encoder	MLM+SOP	BooksCorpus+Wikipedia	github
BART-base, large,m	140M, 400M, 610M	英/多语言	Transformer	DAE	BooksCorpus+Wikipedia	github
RoBERTa-base,large	125M, 355M	英	Transformer-encoder	MLM(动态)	BooksCorpus+Wikipedia + CC-NEWS(76 GB)+ OPENWEBTEXT(38 GB)+STORIES(31 GB)	github
XLM	/	多语言	Transformer	MLM，TLM	Wikipedia+MultiUN+IIT Bombay+OPUS website：(EUbookshop, OpenSubtitles2018, Tanzil, GlobalVoices)	github
ELECTRA-small,base ,large	14M,110M, 335M	英	Generator+ Discriminator	RTD	BooksCorpus+Wikipedia, BooksCorpus+Wikipedia+ ClueWeb+CommonCrawl+ Gigaword	github	中文版
ERNIE-THU	/	英	Transformer-encoder+ KG	MLM+NSP+ KG融合	BooksCorpus+Wikipedia +Wikidata	github
ERNIE 3.0	10B	中/英	Transformer-encoder+KG	MLM (Knowledge Masking)	Chinese text corpora (4TB) 11 different categories.	github	未公开
MASS	120M	翻译	Transformer	Seq2Seq-MLM	WMT16+WMT News Crawl dataset	github
Wu Dao 2.0	1.75T(涵盖很多模型)	中/英/双语	\	\	WuDaoCorpus 4.9 TB	Official website	可下载
CPM-2,MoE	11B,198B	中/英	Transformer-encoder+KG	Span MLM	WuDaoCorpus (zh:2.3 TB; en:300 GB)	Official website
UniLM v2	110M	英	Transformer-encoder	MLM+NSP	BooksCorpus+Wikipedia + CC-NEWS+ OpenWebText+Stories	github
M6	100B	中文-多模态	\	\	images(1.9 TB), texts(292 GB)	github
T5-small, base, large, 3B,11B	60M,220M, 770M,3B,11B	英	Transformer	Span MLM	Common Crawl(750 GB)	github
CODEX	12B	code	Transformer-decoder	基于GPT-3微调	Github Python files (159 GB)	copilot
XLNet-base,large	similar with bert	英	Transformer-encoder	PLM	BooksCorpus+Wikipedia +Giga5+ClueWeb 2012-B,Common Crawl	github	中文版
GPT	117M	英	Transformer-decoder	LM	BooksCorpus	paper
GPT-2	1.5B	英	Transformer-decoder	LM	Common Crawl(40 GB)	github
GPT-3	175B	英	Transformer-decoder	LM	Common Crawl + WebText datase+two internet-based books corpora+ English-language Wikipedia(570 GB-45 TB raw)	Official website	付费

3️⃣部分大模型可用性调研

BERT：参数量110-340M（Bert系列及其变种等小型PLM一般都已开源参数，可下载本地使用）
T5：参数量11B，模型大小约15GB，可下载本地使用
GPT-2：参数量1.5B，付费
GPT-3：参数量175B，付费，0.7-0.01RMB/1K TOKENS，付费方式（中国不在申请地区）
华为盘古：参数量200B，未开放，在咨询
百度ERNIE3.0：参数量10B，在咨询
RoBERTa：参数量125-355M，可下载使用
ALBERT：参数量125M
悟道2.0-GLM（General Language Model）：参数10B，申请下载使用
悟道2.0-CPM（Chinese Pretrained Models）：参数2.6,11,198B，申请下载使用
BART：参数量400M，可下载使用

Referenced from

4️⃣Transformer预训练模型适用任务汇总(Referenced from PaddleNLP)

Model	Sequence Classification	Token Classification	Question Answering	Text Generation	Multiple Choice
ALBERT	✅	✅	✅	❌	✅
BART	✅	✅	✅	✅	❌
BERT	✅	✅	✅	❌	✅
BigBird	✅	✅	✅	❌	✅
Blenderbot	❌	❌	❌	✅	❌
Blenderbot-Small	❌	❌	❌	✅	❌
ConvBert	✅	✅	✅	✅	✅
CTRL	✅	❌	❌	❌	❌
DistilBert	✅	✅	✅	❌	❌
ELECTRA	✅	✅	❌	❌	✅
ERNIE	✅	✅	✅	❌	❌
ERNIE-DOC	✅	✅	✅	❌	❌
ERNIE-GEN	❌	❌	❌	✅	❌
ERNIE-GRAM	✅	✅	✅	❌	❌
GPT	✅	✅	❌	✅	❌
LayoutLM	✅	✅	❌	❌	❌
LayoutLMV2	❌	✅	❌	❌	❌
LayoutXLM	❌	✅	❌	❌	❌
Mbart	✅	❌	✅	❌	✅
MobileBert	✅	❌	✅	❌	❌
MPNet	✅	✅	✅	❌	✅
NeZha	✅	✅	✅	❌	✅
ReFormer	✅	❌	✅	❌	❌
RoBERTa	✅	✅	✅	❌	❌
RoFormer	✅	✅	✅	❌	❌
SKEP	✅	✅	❌	❌	❌
SqueezeBert	✅	✅	✅	❌	❌
T5	❌	❌	❌	✅	❌
TinyBert	✅	❌	❌	❌	❌
UnifiedTransformer	❌	❌	❌	✅	❌
XLNet	✅	✅	❌	❌	❌

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1️⃣Pre-training tasks(Referenced from)

😀Language modeling (LM)

😂Masked language modeling (MLM)

😊Permuted language modeling (PLM)

😆Denoising autoencoder (DAE)

🙂Contrastive learning (CTL)

😵Others

2️⃣Pre-trained LMs

3️⃣部分大模型可用性调研

4️⃣Transformer预训练模型适用任务汇总(Referenced from PaddleNLP)

About

Releases

Packages

License

BIT-Xu/Pre-trained-Language-Models

Folders and files

Latest commit

History

Repository files navigation

1️⃣Pre-training tasks(Referenced from)

😀Language modeling (LM)

😂Masked language modeling (MLM)

😊Permuted language modeling (PLM)

😆Denoising autoencoder (DAE)

🙂Contrastive learning (CTL)

😵Others

2️⃣Pre-trained LMs

3️⃣部分大模型可用性调研

4️⃣Transformer预训练模型适用任务汇总(Referenced from PaddleNLP)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages