Skip to content

Commit 3337f72

Browse files
authored
[BERT/PyT] Update DataPrep (#595)
* Update DataPrep * Update create_datasets_from_start.sh * Update README.md * Update README.md * Update README.md
1 parent 038e7f1 commit 3337f72

File tree

2 files changed

+42
-21
lines changed

2 files changed

+42
-21
lines changed

PyTorch/LanguageModeling/BERT/README.md

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -254,10 +254,10 @@ If you want to use a pre-trained checkpoint, visit [NGC](https://ngc.nvidia.com/
254254

255255
Resultant logs and checkpoints of pretraining and fine-tuning routines are stored in the `results/` folder.
256256

257-
`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
258-
257+
`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
258+
259259
5. Download and preprocess the dataset.
260-
260+
261261
This repository provides scripts to download, verify, and extract the following datasets:
262262

263263
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (fine-tuning for question answering)
@@ -266,9 +266,23 @@ This repository provides scripts to download, verify, and extract the following
266266

267267
To download, verify, extract the datasets, and create the shards in `.hdf5` format, run:
268268
`/workspace/bert/data/create_datasets_from_start.sh`
269-
270-
Note: For fine tuning only, Wikipedia and Bookscorpus dataset download can be skipped by commenting it out. The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time. Expired dataset links are ignored during data download.
271-
269+
270+
Note: For fine tuning only, Wikipedia and Bookscorpus dataset download and preprocessing can be skipped by commenting it out.
271+
272+
- Download Wikipedia only for pretraining
273+
274+
The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server most of the times get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. Hence, it is recommended to skip downloading BookCorpus data by running:
275+
276+
`/workspace/bert/data/create_datasets_from_start.sh wiki_only`
277+
278+
- Download Wikipedia and BookCorpus
279+
280+
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
281+
282+
`/workspace/bert/data/create_datasets_from_start.sh wiki_books`
283+
284+
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
285+
272286
6. Start pretraining.
273287

274288
To run on a single node 8 x V100 32G cards, from within the container, you can use the following script to run pre-training.

PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,30 +13,37 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
# Download
17-
python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
18-
python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
16+
to_download=${1:-"wiki_only"}
1917

20-
python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
18+
#Download
19+
if [ "$to_download" = "wiki_books" ] ; then
20+
python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
21+
fi
2122

23+
python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
24+
python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
2225
python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
23-
#python3 /workspace/bert/data/bertPrep.py --action download --dataset mrpc
24-
2526

2627
# Properly format the text files
27-
python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
28+
if [ "$to_download" = "wiki_books" ] ; then
29+
python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
30+
fi
2831
python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset wikicorpus_en
2932

33+
if [ "$to_download" = "wiki_books" ] ; then
34+
DATASET="books_wiki_en_corpus"
35+
else
36+
DATASET="wikicorpus_en"
37+
# Shard the text files
38+
fi
3039

31-
# Shard the text files (group wiki+books then shard)
32-
python3 /workspace/bert/data/bertPrep.py --action sharding --dataset books_wiki_en_corpus
33-
40+
# Shard the text files
41+
python3 /workspace/bert/data/bertPrep.py --action sharding --dataset $DATASET
3442

3543
# Create HDF5 files Phase 1
36-
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset books_wiki_en_corpus --max_seq_length 128 \
37-
--max_predictions_per_seq 20 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
38-
44+
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 128 \
45+
--max_predictions_per_seq 20 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
3946

4047
# Create HDF5 files Phase 2
41-
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset books_wiki_en_corpus --max_seq_length 512 \
42-
--max_predictions_per_seq 80 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
48+
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 512 \
49+
--max_predictions_per_seq 80 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1

0 commit comments

Comments
 (0)