We present an efficient BERT-based multi-task (MT) framework that is particularly suitable for iterative and incremental development of the tasks. Unlike in conventional Multi-Task Learning where the tasks are coupled due to joint training, in our framework the tasks are independent of each other and can be updated on a per-task basis. One key advantage our framework provides is that the owner of one task do not need to coordinate with other task owners in order to update the model for that task, and any modification made to that task does not interfere with the rest of the tasks.
We recommend to install all relevant packages in a virtual environment
conda create -n centra-bert --file requirements.txt python=3.6
conda activate centra-bert
The proposed framework is based on the idea of partial fine-tuning, i.e. only fine-tune some top layers of BERT while keep the other layers frozen. It features a pipeline that consists of three steps:
- Single-task partial fine-tuning
- Single-task knowledge-distillation
- Model merging
In what follows, we demonstrate the functionality that this library offers on two GLUE tasks MRPC and RTE as an example.
Before proceeding, we need to download the task corpora and convert their format. This can be achieved with the following script:
python convert_data_format.py --task=rte --input_dir=data/glue --output_dir=data/glue
python convert_data_format.py --task=mrpc --input_dir=data/glue --output_dir=data/glue
Next, we need to create a task config file conf/glue_task_config.cfg
to specify the meta information for each of the tasks, including
name of the task, type of the task, corpus path, etc.
[rte_conf]
task_name = rte
task_type = classification
input_file = data/glue/rte/train_json_format.txt,data/glue/rte/dev_json_format.txt,data/glue/rte/test_json_format.txt
max_seq_length = 128
output_method = cls
is_eng = True
[mrpc_conf]
task_name = mrpc
task_type = classification
input_file = data/glue/mrpc/train_json_format.txt,data/glue/mrpc/dev_json_format.txt,data/glue/mrpc/test_json_format.txt
max_seq_length = 128
output_method = cls
is_eng = True
In the first step, we partial fine-tune for each task an independent copy of BERT.
The exact number of layers L
to fine-tune may vary across the tasks.
We propose to experiment for each task with different value of L
and select the best one according to some predefined criterion.
The following code snippet (see also shell/fine_tuning.sh
) trains a number of models for the task of RTE with different hyper-parameters.
#!/usr/bin/env bash
# General param
bert_config_file=conf/uncased_bert_base/bert_config.json
vocab_file=conf/uncased_bert_base/vocab.txt
output_dir=model/glue/teacher
init_checkpoint=model/uncased_bert_base/bert_model.ckpt
task_config=conf/glue_task_config.cfg
num_train_epoch=10
train_batch_size=64
gpu_id=2
# Current task
task=rte
# Hyper param, separated by commas
learning_rate=2e-5,5e-5
fine_tuning_layers=4,5,6,7,8,9,10
# Number of repetitions for each hyper parameter
exam_num=3
for lr in ${learning_rate//,/ }
do
for layers in ${fine_tuning_layers//,/ }
do
for i in $(seq 1 ${exam_num})
do
python fine_tuning.py \
--bert_config_file=${bert_config_file} \
--vocab_file=${vocab_file} \
--output_dir=${output_dir} \
--init_checkpoint=${init_checkpoint} \
--task_config=${task_config} \
--available_tasks=${task} \
--current_task=${task} \
--ex_idx=${i} \
--num_train_epoch=${num_train_epoch} \
--train_batch_size=${train_batch_size} \
--learning_rate=${lr} \
--fine_tuning_layers=${layers} \
--gpu_id=${gpu_id}
done
done
done
# Result summary
python result_summary.py \
--output_dir=${output_dir} \
--task=${task} \
--learning_rate=${learning_rate} \
--fine_tuning_layers=${fine_tuning_layers} \
--exam_num=${exam_num} \
--dev=True \
--version=teacher
When the training is completed, we can find in the log file model/glue/teacher/rte/summary.txt
the information on model with the best dev result:
Best metrics: 89.77, best checkpoint: model/glue/teacher/mrpc/Lr-2e-05-Layers-8/ex-3/best_checkpoint/1623838640/model.ckpt-570
This is the teacher model that will be compressed in the next step.
In this step,
we compress the L
fine-tuned layers in the teacher model into a smaller l
layered module.
The following snippet (see also shell/distill.sh
) trains three student models for each l
in {1, 2, 3}
.
The training process is basically the same as in the previous step. The only difference is that we need to specify the path to the teacher model that is going to be distilled.
#!/usr/bin/env bash
# General param
bert_config_file=conf/uncased_bert_base/bert_config.json
vocab_file=conf/uncased_bert_base/vocab.txt
output_dir=model/glue/student
task_config=conf/glue_task_config.cfg
num_train_epoch=10
train_batch_size=64
gpu_id=6
# Teacher info
teacher_fine_tuning_layers=9
best_teacher_checkpoint=model/glue/teacher/rte/Lr-2e-05-Layers-9/ex-3/best_checkpoint/1623902794/model.ckpt-380
# Current task
task=rte
# Hyper param, separated by commas
learning_rate=2e-5
fine_tuning_layers=1,2,3
# Number of repetitions for each hyper parameter
exam_num=3
for lr in ${learning_rate//,/ }
do
for layers in ${fine_tuning_layers//,/ }
do
for i in $(seq 1 ${exam_num})
do
python distill.py \
--bert_config_file=${bert_config_file} \
--vocab_file=${vocab_file} \
--output_dir=${output_dir} \
--best_teacher_checkpoint=${best_teacher_checkpoint} \
--teacher_fine_tuning_layers=${teacher_fine_tuning_layers} \
--task_config=${task_config} \
--available_tasks=${task} \
--current_task=${task} \
--ex_idx=${i} \
--num_train_epoch=${num_train_epoch} \
--train_batch_size=${train_batch_size} \
--learning_rate=${lr} \
--student_fine_tuning_layers=${layers} \
--gpu_id=${gpu_id}
done
done
done
# Result summary
python result_summary.py \
--output_dir=${output_dir} \
--task=${task} \
--learning_rate=${learning_rate} \
--fine_tuning_layers=${fine_tuning_layers} \
--exam_num=${exam_num} \
--dev=True \
--keep_layers=$((12-teacher_fine_tuning_layers)) \
--version=student
In the final step, we merge the single task models into one multi-task model.
To do this, we need to specify in the config file conf/branch.cfg
which checkpoint to load,
and which are fine-tuned layers for each of the tasks:
[ckpt_conf]
mrpc = model/glue/student/mrpc/Lr-2e-05-Layers-4-2/ex-2/best_checkpoint/1623900741/model.ckpt-572
rte = model/glue/student/rte/Lr-2e-05-Layers-3-2/ex-3/best_checkpoint/1623910751/model.ckpt-382
[layer_conf]
mrpc = 5,6
rte = 4,5
Then we run the script shell/merge.sh
to merge all task branches:
#!/usr/bin/env bash
# General param
bert_config_file=conf/uncased_bert_base/bert_config.json
vocab_file=conf/uncased_bert_base/vocab.txt
output_dir=model/glue/merge
init_checkpoint=model/uncased_bert_base/bert_model.ckpt
task_config=conf/glue_task_config.cfg
branch_config=conf/branch.cfg
gather_from_student=True
gpu_id=3
input_file=data/glue/tmp_input_file.txt
# Current tasks
available_tasks=mrpc,rte
python merge_branch.py \
--bert_config_file=${bert_config_file} \
--vocab_file=${vocab_file} \
--output_dir=${output_dir} \
--init_checkpoint=${init_checkpoint} \
--task_config=${task_config} \
--branch_config=${branch_config} \
--available_tasks=${available_tasks} \
--gather_from_student=${gather_from_student} \
--gpu_id=${gpu_id} \
--input_file=${input_file}
Basically, this script iteratively adds task branches to a frozen backbone model and save checkpoint files at each intermediate step.
The checkpoint in the latest task directory (in our example, model/glue/merge/rte
) contains the final merged multi-task model.
Assume that we have a merged multi-task model containing three task branches: mrpc, rte and mnli. To remove a branch, e.g. rte, we run the following script:
#!/usr/bin/env bash
# General param
bert_config_file=conf/uncased_bert_base/bert_config.json
vocab_file=conf/uncased_bert_base/vocab.txt
output_dir=model/glue/merge
init_checkpoint=somewhere/merged_model/model.ckpt
task_config=conf/glue_task_config.cfg
branch_config=conf/branch.cfg
gather_from_student=True
gpu_id=3
input_file=data/glue/tmp_input_file.txt
# after removing rte, only mrpc and mnli remain
available_tasks=mrpc,mnli
python update.py \
--bert_config_file=${bert_config_file} \
--vocab_file=${vocab_file} \
--output_dir=${output_dir} \
--init_checkpoint=${init_checkpoint} \
--task_config=${task_config} \
--branch_config=${branch_config} \
--available_tasks=${available_tasks} \
--gather_from_student=${gather_from_student} \
--gpu_id=${gpu_id} \
--input_file=${input_file}
Note that in the code above we did not explicitly specify which task to remove.
Instead, we have specified in the argument available_tasks
which tasks to keep.
Assume that we have a merged multi-task model containing three task branches: mrpc, rte and mnli. The following snippet adds a new task (qnli) to an existing merged multi-task model:
#!/usr/bin/env bash
# General param
bert_config_file=conf/uncased_bert_base/bert_config.json
vocab_file=conf/uncased_bert_base/vocab.txt
output_dir=model/glue/merge
init_checkpoint=somewhere/merged_model/model.ckpt
task_config=conf/glue_task_config.cfg
branch_config=conf/branch.cfg # A new branch config which contains mnli's fine tuning layers
gather_from_student=True
gpu_id=3
input_file=data/glue/tmp_input_file.txt
# Current tasks, add the new task mnli
available_tasks=mrpc,rte,mnli,qnli
# new task info
update_checkpoint=somewhere/qnli/model.ckpt
update_scope=qnli
python update.py \
--bert_config_file=${bert_config_file} \
--vocab_file=${vocab_file} \
--output_dir=${output_dir} \
--init_checkpoint=${init_checkpoint} \
--update_checkpoint=${update_checkpoint} \
--update_scope=${update_scope} \
--task_config=${task_config} \
--branch_config=${branch_config} \
--available_tasks=${available_tasks} \
--gather_from_student=${gather_from_student} \
--gpu_id=${gpu_id} \
--input_file=${input_file}
The following snippet updates the task branch for task rte:
#!/usr/bin/env bash
# General param
bert_config_file=conf/uncased_bert_base/bert_config.json
vocab_file=conf/uncased_bert_base/vocab.txt
output_dir=model/glue/merge
init_checkpoint=somewhere/merged_model/model.ckpt
task_config=conf/glue_task_config.cfg
branch_config=conf/branch.cfg # A new branch config if you change rte's fine tuning layers
gather_from_student=True
gpu_id=3
input_file=data/glue/tmp_input_file.txt
# Current tasks
available_tasks=mrpc,rte
# new task info
update_checkpoint=somewhere/new-rte/model.ckpt
update_scope=rte
python update.py \
--bert_config_file=${bert_config_file} \
--vocab_file=${vocab_file} \
--output_dir=${output_dir} \
--init_checkpoint=${init_checkpoint} \
--update_checkpoint=${update_checkpoint} \
--update_scope=${update_scope} \
--task_config=${task_config} \
--branch_config=${branch_config} \
--available_tasks=${available_tasks} \
--gather_from_student=${gather_from_student} \
--gpu_id=${gpu_id} \
--input_file=${input_file}
The merged model is generated as follows:
- Build the graph that contains a frozen part and several fine-tuned parts according to the
layer_conf
inbranch.config
- Load parameters from
init_checkpoint
to initialize the frozen 'backbone' model - Load parameters with specific scope
update_scope
fromupdate_checkpoint
to initialize the new task branch or reinitialize an existing task branch.
One can use the following scripts to plot the task performance with different hyper-parameter settings:
output_dir=model/glue/teacher
task=mrpc
learning_rate=2e-5,1e-4
fine_tuning_layers=6,8
exam_num=3
# key_param=fine_tuning_layers
key_param=learning_rate
python result_summary.py \
--job=plot \
--key_param=${key_param} \
--output_dir=${output_dir} \
--task=${task} \
--learning_rate=${learning_rate} \
--fine_tuning_layers=${fine_tuning_layers} \
--exam_num=${exam_num} \
--dev=True \
--version=teacher