Skip to content

Latest commit

 

History

History
5 lines (4 loc) · 724 Bytes

README.md

File metadata and controls

5 lines (4 loc) · 724 Bytes

We provide the following three pretraining data files extracted from the Wikipedia and OpenWebText Corpus:

  • openwebtext_questions.txt contains questions extracted from a subset of the OpenWebText Corpus downloaded here.
  • wiki_long.txt contains long Wikipedia sequences (between 20 and 70 words) extracted from the 1M Wikipedia sentences downloaded with this script.
  • wiki_short.txt contains short Wikipedia sequences (between 5 and 30 words) extracted from the 1M Wikipedia sentences downloaded with this script.