You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PyTorch/LanguageModeling/BERT/README.md
+20-6Lines changed: 20 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -254,10 +254,10 @@ If you want to use a pre-trained checkpoint, visit [NGC](https://ngc.nvidia.com/
254
254
255
255
Resultant logs and checkpoints of pretraining and fine-tuning routines are stored in the `results/` folder.
256
256
257
-
`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
258
-
257
+
`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
258
+
259
259
5. Download and preprocess the dataset.
260
-
260
+
261
261
This repository provides scripts to download, verify, and extract the following datasets:
262
262
263
263
-[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (fine-tuning for question answering)
@@ -266,9 +266,23 @@ This repository provides scripts to download, verify, and extract the following
266
266
267
267
To download, verify, extract the datasets, and create the shards in `.hdf5` format, run:
Note: For fine tuning only, Wikipedia and Bookscorpus dataset download can be skipped by commenting it out. The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time. Expired dataset links are ignored during data download.
271
-
269
+
270
+
Note: For fine tuning only, Wikipedia and Bookscorpus dataset download and preprocessing can be skipped by commenting it out.
271
+
272
+
- Download Wikipedia only for pretraining
273
+
274
+
The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server most of the times get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. Hence, it is recommended to skip downloading BookCorpus data by running:
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
0 commit comments