Skip to content

Latest commit

 

History

History

pretrain

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Pretrain BERT Model on Azure Machine Learning service

To pretrain BERT language representation models on AzureML, following artifacts are required:

  • Azure Machine Learning Workspace with an AzureML Compute cluster with 64 V100 GPUs (either 16 x NC24s_v3 or 8 x ND40_v2 VMs). Note that by default your subscription might not have enough quota and you are likely to submit a support ticket to get enough quota by following the guide here.
  • Preprocessed data: BERT paper references Wikipedia and BookCorpus datasets for pretraining. The notebook in this pretrain recipe is configured to use Wikipedia dataset only, but can be used with other datasets as well, including custom datasets. The preprocessed data should be available in a Datastore registered to the AzureML Workspace that will be used for BERT pretraining. Preprocessed Wikipedia corpus is made available for use with the pretraining recipe in this repo. Refer to the instructions to access preprocessed Wikipedia corpus for pretraining. You can copy the Wikipedia dataset from this location to another Azure blob container and register it as a workspace before using it in the pretraining job. Alternatively, you can preprocess the data from scratch (refer to instructions on this), upload that to an Azure blob container and use it as the datastore for the pretraining job. Note that it is also possible to use other datasets with little or no modifications in this pretraining recipe.
  • Job configuration to define the parameters for the pretraining job. Refer to configs directory for different configuration settings (BERT-base vs. BERT-large, like single-node configurations for debugging vs. multi-node configurations for production-ready pretraining).
  • Code to pretrain BERT model in AzureML. The notebook to submit a pretrain job to AzureML is available at BERT_Pretrain.ipynb.

Submit Pretrain job

BERT_Pretrain.ipynb notebook has the recipe to submit bert-large pretraining job to AzureML service and monitor metrics in Tensorboard.