diff --git a/english/language_modeling.md b/english/language_modeling.md index c95530c4..05414fdd 100644 --- a/english/language_modeling.md +++ b/english/language_modeling.md @@ -20,12 +20,12 @@ per-word log-probability (lower is better). | Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code | | ------------- | :-----:| :-----: | :-----: | -------------- | ---- | -| Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 44.9 | 44.8 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 44.9 | 44.8 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 46.63 | 46.01 | 22M | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) | | FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 47.38 | 46.54 | 22M | [FRAGE: Frequency-Agnostic Word Representation](https://arxiv.org/abs/1809.06858) | [Official](https://github.com/ChengyueGongR/Frequency-Agnostic) | | AWD-LSTM-DOC x5 (Takase et al., 2018) | 48.63 | 47.17 | 185M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) | | AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) | -| Mogrifier LSTM (Melis et al., 2019) | 51.4 | 50.1 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM (Melis et al., 2019) | 51.4 | 50.1 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | 24M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) | | AWD-LSTM-DOC + Partial Shuffle (Press, 2019) ***preprint*** | 53.79 | 52.00 | 23M | [Partially Shuffling the Training Data to Improve Language Models](https://arxiv.org/abs/1903.04167) | [Official](https://github.com/ofirpress/PartialShuffle) | | AWD-LSTM-DOC (Takase et al., 2018) | 54.12 | 52.38 | 23M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) | @@ -46,14 +46,14 @@ consists of around 2 million words extracted from Wikipedia articles. | Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code | | ------------- | :-----:| :-----: | :-----: | -------------- | ---- | -| Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 40.2 | 38.6 | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 40.2 | 38.6 | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 40.27 | 38.65 | 35M | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) | | FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 40.85 | 39.14 | 35M | [FRAGE: Frequency-Agnostic Word Representation](https://arxiv.org/abs/1809.06858) | [Official](https://github.com/ChengyueGongR/Frequency-Agnostic) | | AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | 35M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) | | AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | 33M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) | | AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | 33M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) | | AWD-LSTM-DOC x5 (Takase et al., 2018) | 54.19 | 53.09 | 185M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) | -| Mogrifier LSTM (Melis et al., 2019) | 57.3 | 55.1 | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM (Melis et al., 2019) | 57.3 | 55.1 | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | AWD-LSTM-DOC + Partial Shuffle (Press, 2019) ***preprint*** | 60.16 | 57.85 | 37M | [Partially Shuffling the Training Data to Improve Language Models](https://arxiv.org/abs/1903.04167) | [Official](https://github.com/ofirpress/PartialShuffle) | | AWD-LSTM-DOC (Takase et al., 2018) | 60.29 | 58.03 | 37M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) | | AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | 35M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) | @@ -115,14 +115,14 @@ Within these 100 million bytes are 205 unique tokens. | ---------------- | :-----: | :-----: | -------------- | ---- | | Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 0.94 | 277M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) | | Compressive Transformer (Rae et al., 2019) ***arxiv preprint*** | 0.97 | - | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/pdf/1911.05507.pdf) | - | -| Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 0.988 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 0.988 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | 24-layer Transformer-XL (Dai et al., 2018) ***under review*** | 0.99 | 277M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) | | 18-layer Transformer-XL (Dai et al., 2018) ***under review*** | 1.03 | 88M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) | | 12-layer Transformer-XL (Dai et al., 2018) ***under review*** | 1.06 | 41M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) | | 64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.06 | 235M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) || | mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) | | 12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.11 | 44M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) || -| Mogrifier LSTM (Melis et al., 2019) | 1.122 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM (Melis et al., 2019) | 1.122 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | 3-layer AWD-LSTM (Merity et al., 2018) | 1.232 | 47M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) | | Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) | | Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) | @@ -150,8 +150,8 @@ The vocabulary of the words in the character-level dataset is limited to 10 000 | Model | Bit per Character (BPC) | Number of params | Paper / Source | Code | | ---------------- | :-----: | :-----: | -------------- | ---- | -| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.083 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | -| Mogrifier LSTM (Melis et al., 2019) | 1.120 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.083 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | +| Mogrifier LSTM (Melis et al., 2019) | 1.120 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | Trellis Network (Bai et al., 2019) | 1.159 | 13.4M | [Trellis Networks for Sequence Modeling](https://openreview.net/pdf?id=HyeVtoRqtQ) | [Official](https://github.com/locuslab/trellisnet) | 3-layer AWD-LSTM (Merity et al., 2018) | 1.175 | 13.8M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) | | 6-layer QRNN (Merity et al., 2018) | 1.187 | 13.8M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) | @@ -168,8 +168,8 @@ The character-based [MWC](http://k-kawakami.com/research/mwc) dataset is a colle | Model | Validation BPC | Test BPC | Number of params | Paper / Source | Code | | ------------- | :-----:| :-----: | :-----: | -------------- | ---- | -| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.200 | 1.187 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | -| Mogrifier LSTM (Melis et al., 2019) | 1.312 | 1.298 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.200 | 1.187 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | +| Mogrifier LSTM (Melis et al., 2019) | 1.312 | 1.298 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | HCLM with Cache (Kawakami et al. 2017) | 1.591 | 1.538 | 8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) | | | LSTM (Kawakami et al. 2017) | 1.793 | 1.736 | 8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) | | @@ -177,8 +177,8 @@ The character-based [MWC](http://k-kawakami.com/research/mwc) dataset is a colle | Model | Validation BPC | Test BPC | Number of params | Paper / Source | Code | | ------------- | :-----:| :-----: | :-----: | -------------- | ---- | -| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.202 | 1.191 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | -| Mogrifier LSTM (Melis et al., 2019) | 1.327 | 1.313 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | | +| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.202 | 1.191 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | +| Mogrifier LSTM (Melis et al., 2019) | 1.327 | 1.313 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) | | HCLM with Cache (Kawakami et al. 2017) | 1.754 | 1.711 | 8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) | | | LSTM (Kawakami et al. 2017) | 1.943 | 1.913 | 8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) | |