We provide the following three pretraining data files extracted from the Wikipedia and OpenWebText Corpus:

openwebtext_questions.txt contains questions extracted from a subset of the OpenWebText Corpus downloaded here.
wiki_long.txt contains long Wikipedia sequences (between 20 and 70 words) extracted from the 1M Wikipedia sentences downloaded with this script.
wiki_short.txt contains short Wikipedia sequences (between 5 and 30 words) extracted from the 1M Wikipedia sentences downloaded with this script.

Provide feedback

Saved searches