We provide the following three pretraining data files extracted from the Wikipedia and OpenWebText Corpus:
openwebtext_questions.txt
contains questions extracted from a subset of the OpenWebText Corpus downloaded here.wiki_long.txt
contains long Wikipedia sequences (between 20 and 70 words) extracted from the 1M Wikipedia sentences downloaded with this script.wiki_short.txt
contains short Wikipedia sequences (between 5 and 30 words) extracted from the 1M Wikipedia sentences downloaded with this script.