This repository contains data and code for the WSDM 2022 paper Knowledge Enhanced Sports Game Summarization.
In this work, we propose K-SportsSum dataset as well as the KES model.
- K-SportsSum: It has 7854 sports game summarization samples together with a large-scale knowledge corpus containing information of 523 sports teams and 14k+ sports players.
- KES: a NEW sports game summarization model based on mT5.
The K-SportsSum dataset is available here. You can obtain the following five files from the shared link:
train.json
,val.json
andtest.json
are the core data files of K-SportsSum, each of which contains live commentaries and news reports of sports games.player_knowledge.json
contains background knowledge of 14,724 sports players.team_knowledge.json
contains background information of 523 sports teams.
In this Section, we introduce how to build a two-step baseline system for Sports Game Summarization. As shown in the above Figure, the baseline framework first selects important commentary sentences from original live commentary documents through a text classification model. Ck
represents each commentary sentence and Cik
denotes each selected important sentence. Further, we should convert each selected sentence to a news-style sentence through a generative model. Rik
is the generated sentence corresponding to Cik
.
The selector is a text classification model. In our work, we resort to the following toolkits:
- Chinese-Text-Classification-Pytorch: This toolkit includes multiple codes (both training and inference) of text classification model before BERT era, such as TextCNN.
- Bert-Chinese-Text-Classification-Pytorch: This toolkit contains text classification codes after BERT era, e.g., BERT, ERNIE.
These two toolkits are very useful for buiding a Chinese text classification system.
The rewriter is a generative model. Existing works typically employ PTGen (See, ACL 2017), mBART, mT5 et al.
For PTGen, the pointer_summarizer toolkit is widely used. I also recommend the implementation released by Xiachong Feng in his dialogue summarization work. Both of implementations are convenient. Please note that if u choose to use PTGen as rewriter, you should select a pre-trained word embedding to help model achieve great performance (Chinese-Word-Vectors is helpful).
For mBART, mT5 et al. We use the implementations of Huggingface Transformers Library. I release the corresponding training and inference codes for public use (See rewriter.py
, mBART-50
is used as rewriter in this code).
Requirements: pytorch-lighting 0.8.5; transformers >= 4.4; torch >= 1.7
(This code is based on the Longformer code from AI2.)
For training, you can run commands like this:
python rewriter.py --device_id 0
For evaluation, the command may like this:
python rewriter.py --device_id 0 --test
Note that, if u want to inference with a trained model, remember to initialize the model with corresponding .ckpt
file.
In order to construct training samples for selector and rewriter, we should map each new sentence to corresponding commentary sentence, if possible. (If u do not understand it, please see more details in Section3.1 of SportsSum2.0).
Thus, the core content of this process is calculating the ROUGE scores and BERTScore given two Chinese sentences.
- ROUGE: you can use multilingual_rouge_scoring toolkit to calculate the Chinese ROUGE Scores. Note that, the py-rouge and rouge toolkits are not suitable for Chinese.
- BERTScore: Please find more details in bert_score.
Code will be published once the author of this repo has time.
To facilitate researchers to efficiently comprehend and follow the Sports Game Summarization task, we write a Chinese survey post: 《体育赛事摘要任务概览》, where we also discuss some future directions and give our thoughts.
We list and classify existing works of Sports Game Summarization:
Paper | Conference/Journal | Data/Code | Category |
---|---|---|---|
Towards Constructing Sports News from Live Text Commentary | ACL 2016 | - | Dataset , Ext. |
Overview of the NLPCC-ICCPOL 2016 Shared Task: Sports News Generation from Live Webcast Scripts | NLPCC 2016 | NLPCC 2016 shared task | Dataset |
Research on Summary Sentences Extraction Oriented to Live Sports Text | NLPCC 2016 | - | Ext. |
Sports News Generation from Live Webcast Scripts Based on Rules and Templates | NLPCC 2016 | - | Ext.+Temp. |
Content Selection for Real-time Sports News Construction from Commentary Texts | INLG 2017 | - | Ext. |
Generate Football News from Live Webcast Scripts Based on Character-CNN with Five Strokes | 2020 | - | Ext.+Temp. |
Generating Sports News from Live Commentary: A Chinese Dataset for Sports Game Summarization | AACL 2020 | SportsSum | Dataset , Ext.+Abs. |
SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary | CIKM 2021 | SportsSum2.0 | Dataset , Ext.+Abs. |
Knowledge Enhanced Sports Game Summarization | WSDM 2022 | K-SportsSum | Dataset , Ext.+Abs. |
The concepts used in Category are illustrated as follows:
Dataset
: The work contributes a dataset for sports game summarization.Ext.
: Extractive sports game summarization method.Ext.+Temp.
: The method first extracts important commentary sentence and further utilize the human-labeled template to convey each commentary sentence to a news sentence.Ext.+Abs.
: The method first extracts important commentary sentence and further utilize the seq2seq model to convey each commentary sentence to the news sentence.
Q1: What the differences among SportsSum, SportsSum2.0, SGSum and K-SportsSum?
A1: SportsSum (Huang et al. AACL 2020) is the first large-scale Sports Game Summarization dataset which has 5428 samples. Though its wonderful contribution, the SportsSum dataset has about 15% noisy samples. Thus, SportsSum2.0 (Wang et al, CIKM 2021) cleans the original SportsSum and obtains 5402 samples (26 bad samples in SportsSum are removed). Following previous works, SGSum (Non-Archival Paper, 未正式发表) collects and cleans a large amount of data from massive games. It has 7854 samples. K-SportsSum (Wang et al. WSDM 2022) shuffle and randomly divide the SGSum. Furthermore, K-SportsSum has a large-scale knowledge corpus about sports teams and players, which could be useful for alleviating the knowledge gap issue (See K-SportsSum paper).
Q2: There is less code about sports game summarization.
A2: Yeah, I know that. All existing works follow the pipeline paradigm to build sports game summarization systems. They may have two or three steps together with a pseudo label construction process. Thus, the code is too messy. For the solution, we 1) release a tutorial for building a two-step baseline for Sports Game Summarization (See Section2 in this page); 2) build an end-to-end model for public use (Work in progress, maybe will be published in 2022, but there is no guarantee).
Q3: About position embedding in mT5.
A3: Position embedding of mT5 is set to zero vector since it uses relative position embeddings in self-attention.
Q4: Any questions and suggestions?
A4: Please feel free to contact me (jawang1[at]suda.edu.cn).
Jiaan Wang would like to thank KW Lab, Fudan Univ. and iFLYTEK AI Research, Suzhou for their helpful discussions and GPU device support.
If you find this project is useful or use the data in your work, please consider cite our paper:
@article{Wang2022KnowledgeES,
title={Knowledge Enhanced Sports Game Summarization},
author={Jiaan Wang and Zhixu Li and Tingyi Zhang and Duo Zheng and Jianfeng Qu and An Liu and Lei Zhao and Zhigang Chen},
journal={Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
year={2022}
}