[BERT/PyT] Update DataPrep (#595)

sharathts · web-flow · commit 3337f72cffd2 · 2020-07-08T17:19:45.000-07:00
* Update DataPrep

* Update create_datasets_from_start.sh

* Update README.md

* Update README.md

* Update README.md
diff --git a/PyTorch/LanguageModeling/BERT/README.md b/PyTorch/LanguageModeling/BERT/README.md
@@ -254,10 +254,10 @@ If you want to use a pre-trained checkpoint, visit [NGC](https://ngc.nvidia.com/
  
 Resultant logs and checkpoints of pretraining and fine-tuning routines are stored in the `results/` folder.
  
-`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
- 
+`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining. 
+
 5. Download and preprocess the dataset.
- 
+
 This repository provides scripts to download, verify, and extract the following datasets:
  
 -   [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (fine-tuning for question answering)
@@ -266,9 +266,23 @@ This repository provides scripts to download, verify, and extract the following
  
 To download, verify, extract the datasets, and create the shards in `.hdf5` format, run:  
 `/workspace/bert/data/create_datasets_from_start.sh`
- 
-Note: For fine tuning only, Wikipedia and Bookscorpus dataset download can be skipped by commenting it out. The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time. Expired dataset links are ignored during data download.
- 
+
+Note: For fine tuning only, Wikipedia and Bookscorpus dataset download and preprocessing can be skipped by commenting it out.
+
+- Download Wikipedia only for pretraining
+
+The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server most of the times get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. Hence, it is recommended to skip downloading BookCorpus data by running:
+
+`/workspace/bert/data/create_datasets_from_start.sh wiki_only`
+
+- Download Wikipedia and BookCorpus
+
+Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
+
+`/workspace/bert/data/create_datasets_from_start.sh wiki_books`
+
+Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
+
 6. Start pretraining.
  
 To run on a single node 8 x V100 32G cards, from within the container, you can use the following script to run pre-training.  
diff --git a/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh b/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh
@@ -13,30 +13,37 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# Download
-python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
-python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
+to_download=${1:-"wiki_only"}
 
-python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights  # Includes vocab
+#Download
+if [ "$to_download" = "wiki_books" ] ; then
+    python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
+fi
 
+python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
+python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights  # Includes vocab
 python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
-#python3 /workspace/bert/data/bertPrep.py --action download --dataset mrpc
-
 
 # Properly format the text files
-python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
+if [ "$to_download" = "wiki_books" ] ; then
+    python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
+fi
 python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset wikicorpus_en
 
+if [ "$to_download" = "wiki_books" ] ; then
+    DATASET="books_wiki_en_corpus"
+else
+    DATASET="wikicorpus_en"
+    # Shard the text files
+fi
 
-# Shard the text files (group wiki+books then shard)
-python3 /workspace/bert/data/bertPrep.py --action sharding --dataset books_wiki_en_corpus
-
+# Shard the text files
+python3 /workspace/bert/data/bertPrep.py --action sharding --dataset $DATASET
 
 # Create HDF5 files Phase 1
-python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset books_wiki_en_corpus --max_seq_length 128 \
- --max_predictions_per_seq 20 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
-
+python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 128 \
+--max_predictions_per_seq 20 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
 
 # Create HDF5 files Phase 2
-python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset books_wiki_en_corpus --max_seq_length 512 \
- --max_predictions_per_seq 80 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
+python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 512 \
+--max_predictions_per_seq 80 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1