Skip to content

BIT-Xu/Pre-trained-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

1️⃣Pre-training tasks(Referenced from)

  1. 😀Language modeling (LM)

    A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability $P(w_1, ... , w_m) $ to the whole sequence.

  2. 😂Masked language modeling (MLM)

    MLM first masks out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of the tokens. (可以认为是DAE的一种)

    • 🏄‍♂️Enhanced masked language modeling (E-MLM):

      there are multiple research proposing different enhanced versions of MLM to further improve on BERT.UniLM extends the task of mask prediction on three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. XLM performs MLM on a concatenation of parallel bilingual sentence pairs, called Translation Language Modeling (TLM). SpanBERT replaces MLM with Random Contiguous Words Masking and Span Boundary Objective (SBO) to integrate structure information into pre-training, which requires the system to predict masked spans based on span boundaries. Besides, StructBERT introduces the Span Order Recovery task to further incorporate language structures.Another way to enrich MLM is to incorporate external knowledge.

  3. 😊Permuted language modeling (PLM)

    PLM is a language modeling task on a random permutation of input sequences. A permutation is randomly sampled from all possible permutations. Then some ofthe tokens in the permuted sequence are chosen as the target, and the model is trained to predict these targets, depending on the rest of the tokens and the natural positions of targets.

  4. 😆Denoising autoencoder (DAE)

    Denoising autoencoder (DAE) takes a partially corrupted input and aims to recover the original undistorted input.

    There are several ways to corrupt text:

    • 🏄‍♂️Token masking: Randomly sampling tokens from theinput and replacing them with [MASK] elements.
    • 🤾‍♂️Token deletion: Randomly deleting tokens from the input. Different from token masking, the model needs to decide the positions of missing inputs.
    • 🏋️‍♂️Text infilling: Like SpanBERT, a number of text spans are sampled and replaced with a single [MASK] token. Each span length is drawn from a Poisson distribution (λ = 3). The model needs to predict how many tokens are missing from a span.
    • 🚴‍♀️Sentence permutation: Dividing a document into sentences based on full stops and shuffling these sentences in random order.
    • 🚣‍♀️Document rotation: Selecting a token uniformly at random and rotating the document so that it begins with that to- ken. The model needs to identify the real start
  5. 🙂Contrastive learning (CTL)

    Contrastive learning assumes some observed pairs of text that are more semantically similar than randomly sampled text. The idea behind CTL is "learning by comparison".

    There are some recently proposed CTL tasks:

    • 🏄‍♂️Deep InfoMax (DIM): Deep InfoMax is originally proposed for images, which improves the quality of the representation by maximizing the mutual information between an image representation and local regions of the image.
    • 🤾‍♂️Replaced token detection (RTD): Replaced token detection predicts whether a token is replaced given its surrounding context.
    • 🏋️‍♂️Next sentence prediction (NSP): NSP trains the model to distinguish whether two input sentences are continuous segments from the training corpus.
    • 🚴‍♀️Sentence order prediction (SOP): SOP uses two consecutive segments from the same document as positive examples, and the same two consecutive segments but with their order swapped as negative examples.
  6. 😵Others

    Apart from the above tasks, there are many other auxiliary pre-training tasks designated to incorporate factual knowledge, improve cross-lingual tasks, multi-modal applications, or other specific tasks.

    • 🏄‍♂️Knowledge-enriched PTMs
    • 🤾‍♂️Multilingual and language-specific PTMs
    • 🏋️‍♂️Multi-modal PTMs
    • 🚴‍♀️Domain-specific and task-specific PTMs

2️⃣Pre-trained LMs

模型名称 参数量 语言 模型架构 训练目标 训练语料 链接 其他
BERT-base 110M 英/中/多语言 Transformer-encoder MLM+NSP BooksCorpus+Wikipedia (16 GB) github
BERT-large 340M Transformer-encoder MLM+NSP BooksCorpus+Wikipedia github
StructBERT-base 110M / Transformer-encoder MLM+NSP+SOP BooksCorpus+Wikipedia github
StructBERT-large 340M, 330M 英/中 Transformer-encoder MLM+NSP+SOP BooksCorpus+Wikipedia github
SpanBERT / / Transformer-encoder MLM(SBO) BooksCorpus+Wikipedia github
ALBERT-base,large, xlarge,
xxlarge
12M,18M, 59M,233M 中/英 Transformer-encoder MLM+SOP BooksCorpus+Wikipedia github
BART-base,
large,m
140M, 400M, 610M 英/多语言 Transformer DAE BooksCorpus+Wikipedia github
RoBERTa-base,large 125M, 355M Transformer-encoder MLM(动态) BooksCorpus+Wikipedia + CC-NEWS(76 GB)+ OPENWEBTEXT(38 GB)+STORIES(31 GB) github
XLM / 多语言 Transformer MLM,TLM Wikipedia+MultiUN+IIT Bombay+OPUS website:(EUbookshop, OpenSubtitles2018, Tanzil, GlobalVoices) github
ELECTRA-small,base
,large
14M,110M, 335M Generator+ Discriminator RTD BooksCorpus+Wikipedia, BooksCorpus+Wikipedia+ ClueWeb+CommonCrawl+ Gigaword github 中文版
ERNIE-THU / Transformer-encoder+ KG MLM+NSP+ KG融合 BooksCorpus+Wikipedia +Wikidata github
ERNIE 3.0 10B 中/英 Transformer-encoder+KG MLM (Knowledge Masking) Chinese text corpora (4TB) 11 different categories. github 未公开
MASS 120M 翻译 Transformer Seq2Seq-MLM WMT16+WMT News Crawl dataset github
Wu Dao 2.0 1.75T(涵盖很多模型) 中/英/双语 \ \ WuDaoCorpus 4.9 TB Official website 可下载
CPM-2,MoE 11B,198B 中/英 Transformer-encoder+KG Span MLM WuDaoCorpus (zh:2.3 TB; en:300 GB) Official website
UniLM v2 110M Transformer-encoder MLM+NSP BooksCorpus+Wikipedia + CC-NEWS+ OpenWebText+Stories github
M6 100B 中文-多模态 \ \ images(1.9 TB), texts(292 GB) github
T5-small,
base, large, 3B,11B
60M,220M, 770M,3B,11B Transformer Span MLM Common Crawl(750 GB) github
CODEX 12B code Transformer-decoder 基于GPT-3微调 Github Python files (159 GB) copilot
XLNet-base,large similar with bert Transformer-encoder PLM BooksCorpus+Wikipedia +Giga5+ClueWeb 2012-B,Common Crawl github 中文版
GPT 117M Transformer-decoder LM BooksCorpus paper
GPT-2 1.5B Transformer-decoder LM Common Crawl(40 GB) github
GPT-3 175B Transformer-decoder LM Common Crawl + WebText datase+two internet-based books corpora+ English-language Wikipedia(570 GB-45 TB raw) Official website 付费

3️⃣部分大模型可用性调研

  1. BERT:参数量110-340M(Bert系列及其变种等小型PLM一般都已开源参数,可下载本地使用)
  2. T5:参数量11B,模型大小约15GB,可下载本地使用
  3. GPT-2:参数量1.5B,付费
  4. GPT-3:参数量175B,付费,0.7-0.01RMB/1K TOKENS,付费方式(中国不在申请地区)
  5. 华为盘古:参数量200B,未开放,在咨询
  6. 百度ERNIE3.0:参数量10B,在咨询
  7. RoBERTa:参数量125-355M,可下载使用
  8. ALBERT:参数量125M
  9. 悟道2.0-GLM(General Language Model):参数10B,申请下载使用
  10. 悟道2.0-CPM(Chinese Pretrained Models):参数2.6,11,198B,申请下载使用
  11. BART:参数量400M,可下载使用

image-20220315121115658

4️⃣Transformer预训练模型适用任务汇总(Referenced from PaddleNLP)

Model Sequence Classification Token Classification Question Answering Text Generation Multiple Choice
ALBERT
BART
BERT
BigBird
Blenderbot
Blenderbot-Small
ConvBert
CTRL
DistilBert
ELECTRA
ERNIE
ERNIE-DOC
ERNIE-GEN
ERNIE-GRAM
GPT
LayoutLM
LayoutLMV2
LayoutXLM
Mbart
MobileBert
MPNet
NeZha
ReFormer
RoBERTa
RoFormer
SKEP
SqueezeBert
T5
TinyBert
UnifiedTransformer
XLNet

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published