MC-BERT

Implementation for the paper "MC-BERT: Efficient Language Pre-Training via a Meta Controller", which has been submitted to NeurIPS'2020.

Brief Introduction

This repo is built for the experimental codes in our paper, containing all the model implementation, data preprocessing, and parameter settings. Here we thank the authors of the codebase, fairseq, and our repo is upgraded from it. So more details and usages on fairseq please see the original repo.

Upgrades

General

Add metrics for down-stream tasks. Except for the original accuracy metric, we add pearson_spearman, F1, MCC for various down-stream tasks in GLUE, see the critetions/sentence_prediction.py for sentence prediction.
Checkpoints saving related settings, see the checkpoints utils file. Support transformer-v2 and fix v1 setting based on the original repo, see modules/transformer_sentence_encoder.py.

ELECTRA

Add generator and the other logics, in the model definition file, see the new folder electra in models.
Add a new dataset, with the same as MC-BERT, named mask_tokens_dataset2.py.
Define a new loss in criterions, see electra.py in criterions; a new task, see electra.py in tasks.

MC-BERT

Add the meta controller and the other logics, in the model definition file, see the new folder mcbert in models.
Define a new loss in criterions, see mcbert.py in criterions; a new task, see mcbert.py in tasks.

Requirements and Installation

More details see fairseq. Berifly,

PyTorch version >= 1.2.0
Python version >= 3.5
For training new models, you'll also need an NVIDIA GPU and NCCL
For faster training install NVIDIA's apex library with the --cuda_ext option

Installing from source

To install MC-BERT from source and develop locally:

git clone https://github.com/MC-BERT/MC-BERT
cd MC-BERT
pip install --editable .

Getting Started

Overall Usage

The full documentation of fairseq contains instructions for getting started, training new models and extending fairseq with new model types and tasks.

Data Pre-Processing

Pretraining Data

We follow a couple of consecutive pre-processing steps: segmenting documents into sentences by Spacy, normalizing, lower-casing, and tokenizing the texts by Moses decoder, and finally, applying byte pair encoding (BPE) with setting the vocabulary size |V| as 32,678. The preprocess code refers to preprocess/pretrain/process.sh.

Down-Stream Data

Follow the procedure as the above one, we process the GLUE by preprocess/glue/process.sh.

When reproducing, please modify some related file paths.

Pre-Training Usage

ELECTRA

For pre-training ELECTRA model, you can refer to the following:

#!/usr/bin/env bash
EXEC_ID=electra-50L
DATA_DIR=../data-bin/wiki_book_32768
TOTAL_UPDATES=1000000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0001          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=8        # Number of sequences per batch (batch size)
UPDATE_FREQ=4          # Increase the batch size 16x
SEED=100

python train.py ${DATA_DIR} --fp16 --num-workers 4 --ddp-backend=no_c10d \
       --task electra --criterion electra \
       --arch electra --sample-break-mode complete --tokens-per-sample ${TOKENS_PER_SAMPLE} \
       --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
       --lr-scheduler polynomial_decay --lr ${PEAK_LR} --warmup-updates ${WARMUP_UPDATES} --total-num-update ${TOTAL_UPDATES} \
       --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
       --max-sentences ${MAX_SENTENCES} --update-freq ${UPDATE_FREQ} --seed ${SEED} \
       --loss-lamda 50.0 --mask-prob 0.15 \
       --embedding-normalize --generator-size-divider 3 \
       --max-update ${TOTAL_UPDATES} --log-format simple --log-interval 100 --tensorboard-logdir ../tsb_log/electra-${EXEC_ID} \
       --distributed-world-size 8 --distributed-rank 0 --distributed-init-method "tcp://xxx.xxx.xxx.xxx:8080" \
       --keep-updates-list 20000 50000 100000 200000 \
       --save-interval-updates 10000 --keep-interval-updates 5 --no-epoch-checkpoints --skip-invalid-size-inputs-valid-test \
       --save-dir ../saved_cp/electra-${EXEC_ID}

MC-BERT

For pretrainning MC-BERT model, you can refer to the following:

#!/usr/bin/env bash
EXEC_ID=mcbert-10C-50L
DATA_DIR=../data-bin/wiki_book_32768
TOTAL_UPDATES=1000000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0001          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=8        # Number of sequences per batch (batch size)
UPDATE_FREQ=4          # Increase the batch size 16x
SEED=100

python train.py ${DATA_DIR} --fp16 --num-workers 4 --ddp-backend=no_c10d \
       --task mcbert --criterion mcbert \
       --arch mcbert_base --sample-break-mode complete --tokens-per-sample ${TOKENS_PER_SAMPLE} \
       --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
       --lr-scheduler polynomial_decay --lr ${PEAK_LR} --warmup-updates ${WARMUP_UPDATES} --total-num-update ${TOTAL_UPDATES} \
       --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
       --max-sentences ${MAX_SENTENCES} --update-freq ${UPDATE_FREQ} --seed ${SEED} \
       --loss-lamda 50.0 --mask-prob 0.15 --class-num 10 \
       --embedding-normalize --mc-size-divider 3 \
       --max-update ${TOTAL_UPDATES} --log-format simple --log-interval 100 --tensorboard-logdir ../tsb_log/mcbert-${EXEC_ID} \
       --distributed-world-size 8 --distributed-rank 0 --distributed-init-method "tcp://xxx.xxx.xxx.xxx:8080" \
       --keep-updates-list 20000 50000 100000 200000 \
       --save-interval-updates 10000 --keep-interval-updates 5 --no-epoch-checkpoints --skip-invalid-size-inputs-valid-test \
       --save-dir ../saved_cp/mcbert-${EXEC_ID}

Fine-tuning

After setting hyperparameters, you can fine-tune the model by:

python train.py $DATA_PATH/${PROBLEM}-bin \
       --restore-file $BERT_MODEL_PATH \
       --max-positions 512 \
       --max-sentences $SENT_PER_GPU \
       --max-tokens 4400 \
       --task sentence_prediction \
       --reset-optimizer --reset-dataloader --reset-meters \
       --required-batch-size-multiple 1 \
       --init-token 0 --separator-token 2 \
       --arch $ARCH \
       --criterion sentence_prediction \
       --num-classes $N_CLASSES \
       --dropout 0.1 --attention-dropout 0.1 \
       --weight-decay $WEIGHT_DECAY --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
       --clip-norm 0.0 \
       --lr-scheduler polynomial_decay --lr $LR --total-num-update $N_UPDATES --warmup-updates $WARMUP_UPDATES\
       --max-epoch $N_EPOCH --seed $SEED --save-dir $OUTPUT_PATH --no-progress-bar --log-interval 100 --no-epoch-checkpoints --no-last-checkpoints --no-best-checkpoints \
       --find-unused-parameters --skip-invalid-size-inputs-valid-test --truncate-sequence --embedding-normalize \
       --tensorboard-logdir $TENSORBOARD_LOG/${PROBLEM}/${N_EPOCH}-${BATCH_SZ}-${LR}-${WEIGHT_DECAY}-$SEED \
       --best-checkpoint-metric $METRIC --maximize-best-checkpoint-metric

More detailed setting and results are given in the supplementary files. Thanks for your visiting, if you have any questions, please new an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
fairseq		fairseq
fairseq_cli		fairseq_cli
preprocess		preprocess
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eval_lm.py		eval_lm.py
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MC-BERT

Brief Introduction

Upgrades

General

ELECTRA

MC-BERT

Requirements and Installation

Getting Started

Overall Usage

Data Pre-Processing

Pretraining Data

Down-Stream Data

Pre-Training Usage

ELECTRA

MC-BERT

Fine-tuning

About

Releases

Packages

Languages

License

motefly/MC-BERT

Folders and files

Latest commit

History

Repository files navigation

MC-BERT

Brief Introduction

Upgrades

General

ELECTRA

MC-BERT

Requirements and Installation

Getting Started

Overall Usage

Data Pre-Processing

Pretraining Data

Down-Stream Data

Pre-Training Usage

ELECTRA

MC-BERT

Fine-tuning

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages