Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DataPrep #595

Merged
merged 5 commits into from
Jul 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions PyTorch/LanguageModeling/BERT/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,10 +254,10 @@ If you want to use a pre-trained checkpoint, visit [NGC](https://ngc.nvidia.com/

Resultant logs and checkpoints of pretraining and fine-tuning routines are stored in the `results/` folder.

`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.
`data` and `vocab.txt` are downloaded in the `data/` directory by default. Refer to the [Getting the data](#getting-the-data) section for more details on how to process a custom corpus as required for BERT pretraining.

5. Download and preprocess the dataset.

This repository provides scripts to download, verify, and extract the following datasets:

- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (fine-tuning for question answering)
Expand All @@ -266,9 +266,23 @@ This repository provides scripts to download, verify, and extract the following

To download, verify, extract the datasets, and create the shards in `.hdf5` format, run:
`/workspace/bert/data/create_datasets_from_start.sh`

Note: For fine tuning only, Wikipedia and Bookscorpus dataset download can be skipped by commenting it out. The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time. Expired dataset links are ignored during data download.


Note: For fine tuning only, Wikipedia and Bookscorpus dataset download and preprocessing can be skipped by commenting it out.

- Download Wikipedia only for pretraining

The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server most of the times get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. Hence, it is recommended to skip downloading BookCorpus data by running:

`/workspace/bert/data/create_datasets_from_start.sh wiki_only`

- Download Wikipedia and BookCorpus

Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:

`/workspace/bert/data/create_datasets_from_start.sh wiki_books`

Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.

6. Start pretraining.

To run on a single node 8 x V100 32G cards, from within the container, you can use the following script to run pre-training.
Expand Down
37 changes: 22 additions & 15 deletions PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,30 +13,37 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# Download
python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
to_download=${1:-"wiki_only"}

python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
#Download
if [ "$to_download" = "wiki_books" ] ; then
python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
fi

python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
#python3 /workspace/bert/data/bertPrep.py --action download --dataset mrpc


# Properly format the text files
python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
if [ "$to_download" = "wiki_books" ] ; then
python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
fi
python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset wikicorpus_en

if [ "$to_download" = "wiki_books" ] ; then
DATASET="books_wiki_en_corpus"
else
DATASET="wikicorpus_en"
# Shard the text files
fi

# Shard the text files (group wiki+books then shard)
python3 /workspace/bert/data/bertPrep.py --action sharding --dataset books_wiki_en_corpus

# Shard the text files
python3 /workspace/bert/data/bertPrep.py --action sharding --dataset $DATASET

# Create HDF5 files Phase 1
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset books_wiki_en_corpus --max_seq_length 128 \
--max_predictions_per_seq 20 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1

python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 128 \
--max_predictions_per_seq 20 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1

# Create HDF5 files Phase 2
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset books_wiki_en_corpus --max_seq_length 512 \
--max_predictions_per_seq 80 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 512 \
--max_predictions_per_seq 80 --vocab_file $BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1