This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library [1], that were published before 1919. It also contains metadata of book titles and publication dates.
PG-19 is over double the size of the Billion Word benchmark [2] and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark [3].
Books are partitioned into a train
, validation
, and test
set. Book
metadata is stored in metadata.csv
which contains
(book_id, short_book_title, publication_date)
.
Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom [4] to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text.
To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table.
One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.
Train | Validation | Test | |
Books | 28,602 | 50 | 100 |
Num. Tokens | 1,973,136,207 | 3,007,061 | 6,966,499 |
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||||
---|---|---|---|---|---|---|---|
name | The PG-19 Language Modeling Benchmark |
||||||
alternateName | PG-19 |
||||||
url | https://github.com/deepmind/pg19 |
||||||
sameAs | https://github.com/deepmind/pg19 |
||||||
description | This repository contains the PG-19 dataset.
It includes a set of books extracted from the Project Gutenberg
books project (https://www.gutenberg.org), that were published before
1919. It also contains metadata of book titles and publication dates. |
||||||
provider |
|
||||||
license |
|
||||||
citation | https://identifiers.org/arxiv:1911.05507 |
If you have any questions, please contact Jack Rae.
- [1] https://www.gutenberg.org
- [2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
- [3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
- [4] Ofcom offensive language guide
- [5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
- [6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)