Presenting Corpus of Pretrained Models. Links to pretrained models in NLP and voice with training script.
With rapid progress in NLP it is becoming easier to bootstrap a machine learning project involving text. Instead of starting with a base code, one can now start with a base pretrained model and within a few iterations get SOTA performance. This repository is made with the view that pretrained models minimizes collective human effort and cost of resources, thus accelerating development in the field.
Models listed are curated for either pytorch or tensorflow because of their wide usage.
Note: pytorch-transofmers
is an awesome library which can be used to quickly infer/fine-tune from many pre-trained models in NLP. The pre-trained models from those are not included here.
Name | Link | Trained On | Training script |
---|---|---|---|
XLnet | https://github.com/zihangdai/xlnet/#released-models | booksCorpus +English Wikipedia +Giga5 +ClueWeb 2012-B +Common Crawl |
https://github.com/zihangdai/xlnet/ |
Name | Link | Trained On | Training script |
---|---|---|---|
OpenNMT | http://opennmt.net/Models-py/ (pytorch) http://opennmt.net/Models-tf/ (tensorflow) | English-German | https://github.com/OpenNMT/OpenNMT-py |
Fairseq (multiple models) | https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md#pre-trained-models | WMT14 English-French, WMT16 English-German | https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md |
Name | Link | Trained On | Training script |
---|---|---|---|
Nvidia sentiment-discovery | https://github.com/NVIDIA/sentiment-discovery#pretrained-models | SST, imdb, Semeval-2018-tweet-emotion | https://github.com/NVIDIA/sentiment-discovery |
MT-DNN Sentiment | https://drive.google.com/open?id=1-ld8_WpdQVDjPeYhb3AK8XYLGlZEbs-l | SST | https://github.com/namisan/mt-dnn |
Rank | Name | Link | Training script |
---|---|---|---|
49 | BiDaf | https://s3-us-west-2.amazonaws.com/allennlp/models/bidaf-model-2017.09.15-charpad.tar.gz | https://github.com/allenai/allennlp |
Model for English summarization
Name | Link | Trained On | Training script |
---|---|---|---|
OpenNMT | http://opennmt.net/Models-py/ | Gigaword standard | https://github.com/OpenNMT/OpenNMT-py |
Datasets referenced in this document
Wikipedia data dump (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html
Wikipedia cleaned text (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html
1 Billion Word Language Model Benchmark https://www.statmt.org/lm-benchmark/
Wikitext 103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
Original dataset not released by the authors. An open source collection is available at https://skylion007.github.io/OpenWebTextCorpus/
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
https://yknzhu.wixsite.com/mbweb https://github.com/soskek/bookcorpus
Stanford sentiment tree bank https://nlp.stanford.edu/sentiment/index.html. One of the Glue tasks.
IMDB movie review dataset used for sentiment classification http://ai.stanford.edu/~amaas/data/sentiment
Semeval 2018 tweet emotion dataset https://competitions.codalab.org/competitions/17751
Glue is a collection of resources for benchmarking natural language systems. https://gluebenchmark.com/ Contains datasets on natural language inference, sentiment classification, paraphrase detection, similarity matching and lingusitc acceptability.
https://pdfs.semanticscholar.org/a723/97679079439b075de815553c7b687ccfa886.pdf
www.danielpovey.com/files/2015_icassp_librispeech.pdf
https://ieeexplore.ieee.org/document/225858/
https://github.com/mozilla/voice-web
https://datashare.is.ed.ac.uk/handle/10283/2651
High quality research which doesn't include pretrained models and/or code for public use.
- KERMIT https://arxiv.org/abs/1906.01604 Generative Insertion-Based Modeling for Sequences. No code.
Built on pytorch, allen nlp has produced SOTA models and open sourced them. https://github.com/allenai/allennlp/blob/master/MODELS.md
They have neat interactive demo on various tasks at https://demo.allennlp.org/
Based on MXNet this library has extensive list of pretrained models on various tasks in NLP. http://gluon-nlp.mxnet.io/master/index.html#model-zoo