English · 简体中文
We are the team BERloomers, ranked 9th in the preliminary round and 15th in the rematch.
Members: KennyWu (kennywu96@163.com), null_li (lizx3845@163.com).
This open source solution was redesigned after the competition, and the offline validation set F1 is about 0.7887.
We welcome your comments and corrections.
The contestants need to correctly determine whether two texts match or not.
The data is divided into two files, A and B, which have different matching criteria. The A and B files are divided into short text matching, short and long text matching and long and long text matching.
The A file has a broader matching criterion, where two paragraphs are considered to be a match if they are on the same topic, while the B file has a stricter matching criterion, where two paragraphs are considered to be a match only if they are on the same event.
# A short-short sample
{
"source": "小艺的故事让爱回家2021年2月16日大年初五19:30带上你最亲爱的人与团团君相约《小艺的故事》直播间!",
"target": " 香港代购了不起啊,宋点卷竟然在直播间“炫富”起来",
"labelA": "0"
}
# B short-short sample
{
"source": "让很多网友好奇的是,张柏芝在一小时后也在社交平台发文:“给大家拜年啦。”还有网友猜测:谢霆锋的经纪人发文,张柏芝也发文,并且配图,似乎都在证实,谢霆锋依旧和王菲在一起,而张柏芝也有了新的恋人,并且生了孩子,两人也找到了各自的归宿,有了自己的幸福生活,让传言不攻自破。",
"target": " 陈晓东谈旧爱张柏芝,一个口误暴露她的秘密,难怪谢霆锋会离开她",
"labelB": "0"
}
In order to learn as much information as possible from the data while also taking into account standards A, B, and the three minor classification criteria, our plan is based on a multi-task learning framework, sharing some parameters for representation learning, and then designing task-specific classifiers for label prediction.
The framework is designed based on BERT for an interactive model, using BERT to obtain vector representations of source-target pairs. The overall structure of this plan is shown in the figure below:
In this plan, the results from the last 3 layers are used for learning downstream tasks. Moreover, considering the characteristic of this competition being divided into 6 sub-tasks, we introduce the concept of Special Tokens.
-
Six Type Tokens are proposed to guide the representation learning of text:
Token Task type SSA short-short A SSB short-short B SLA short-long A SLB short-long B LLA long-long A LLB long-long B -
Use
[<S>]
and[</S>]
to distinguish the source, and[<T>]
and[</T>]
to distinguish the target. (The corresponding special tokens have been added to the vocab.txt file under models/*.)
In order to better learn the representation of Type Token
and to assist in the learning of Task-Attentive Classifier
,
we have proposed a data type label prediction task,
which is to determine the type of task the current input belongs to based on the representation of Type Token
.
Adhering to the philosophy of "harmony in diversity,"
tasks A and B each independently employ a Task-Attentive Classifier
.
Meanwhile, the representation of Type Token is passed as additional conditional information into the Classifier
for attention computation,
thereby obtaining type-specific features for label prediction.
The training data used in this plan includes both the training set provided for the semifinals and all the data provided in the preliminary round. The validation set provided for the semifinals is used to evaluate the performance of the model.
Based on the F1 scores of three models on the offline validation set,
different weights were assigned,
and the optimal weight combination was found through automatic search to achieve the best performance offline.
In this solution, the WoBERT
model used comes from the customized PyTorch-based pre-trained model loading framework created by member KennyWu,
torchKbert,
and we express our gratitude to Zhuiyi Technology for their open-sourced WoBERT.
模型 | 链接 |
---|---|
BERT-wwm-ext | https://github.com/ymcui/Chinese-BERT-wwm |
RoBERTa-wwm-ext | https://github.com/ymcui/Chinese-BERT-wwm |
WoBERT | https://github.com/ZhuiyiTechnology/WoBERT |
The open-source solution for this round has been improved over the previous submission in three aspects: data segmentation, model architecture, and model ensemble.
- Data segmentation:Expand the training set: the training set provided for the second round --> the training set provided for the second round + all the data from the preliminary round.
- Model architecture:The network structure has been redesigned to improve the method of Task-specific encoding.
- Model fusion:The plan submitted for the second round of the competition used the fusion of
BERT-wwm-ext
andERNIE-1.0
models. This plan, however, employs the fusion ofBERT-wwm-ext
,RoBERTa-wwm-ext
, andWoBERT
.
The preparation for the semi-final was somewhat hasty, and the plan we submitted for the semi-final had many shortcomings. Therefore, after the competition, we reviewed the entire process and improved the plan. Although we did not make it into the Top 10 in the semi-final, this competition was still a valuable experience for us.