Paper | Task | Dataset | Run Code | Citation | License | Contact
This is the source code of the paper CREAD: Combined Resolution of Ellipses and Anaphora in Dialogues. In this work, we propose a novel joint learning framework of modeling coreference resolution and query rewriting for complex, multi-turn dialogue understanding. The coreference resolution MuDoCo dataset augmented with our query rewrite annotation is released as well.
Given an ongoing dialogue between a user and a dialogue assistant, for the user query, the model is required to predict both coreference links between the query and the dialogue context, and the self-contained rewritten user query that is independent to the dialogue context.
The MuDoCo dataset is a public dataset that contains 7.5k task-oriented multi-turn dialogues across 6 domains (calling, messaging, music, news, reminders, weather). Each dialogue turn is annotated with coreference links (links
field). Please refer to MuDoCo for more details.
In the MuDoCo-QR-dataset used in work, we annotate the query rewrite for each utterance, including both user and system turn. On top of the MudoCo data format, we add three fields graded
, rewrite_required
and rewritten_utterance
. Most of the turns are with annotated with query rewrite (graded
is true). Only 1.4% dialogue turns with incomplete dialogue context (e.g., missing turns) in MuDoCo are filtered out (graded
is false). rewrite_required
records whether the input utterance should be rewritten or not. rewritten_utterance
is the rewritten query, same as the utterance if rewrite_required
is false.
The resulting dataset is provided in the folder MuDoCo-QR-dataset
.
{
"number": 3,
"utterance": "Show me a live version that he moonwalks on .",
"links": [
[
{
"turn_id": 1,
"text": "Michael Jackson",
"span": {
"start": 5,
"end": 20
}
},
{
"turn_id": 3,
"text": "he",
"span": {
"start": 28,
"end": 30
}
}
]
],
"graded": true,
"rewritten_utterance": "Show me a live version that Michael Jackson moonwalks on",
"rewrite_required": true
}
python3.6 and the packages in requirements.txt
, install them by running
>>> pip install -r requirements.txt
Enter the modeling
folder and follow the instruction below.
>>> cd modeling
First run the following command to prepare the data for training.
The processed data will be stored in the proc_data/
directory.
>>> python utils/process_data.py
Run train.sh
to train the model, which calls main.py
with default hyper-parameters.
>>> bash train.sh [job_name]
The model checkpoint will be stored at checkpoint/$job_name
, and training log file is at log/$job_name.log
A reference training log (log/trained-cread.log
) is provided.
Run decode.sh
to decode using a trained model. job_name
is the same as specified in training.
>>> bash decode.sh [job_name]
Evaluation result, with both generated rewritten utterances and model performance, is recorded in deocde/$job_name.json
.
A reference decoding file (decode/trained-cread.json
) is provided.
-
task: which task to perform. The default value
qr-coref
specifies our complete joint learning model. Set toqr
for the model variantqr-only
model orcoref
for the model variantcoref-only
model. -
coref_layer_idx: which gpt2 layers to use for coreference resolution, e.g., "1,5,11" uses three layers. n is between 0 to 11, if default gpt2-small is used.
-
n_coref_head: how many attention heads to use in each layer for coreference resolution. n is between 1 to 12.
-
use_coref_attn: whether to use coref2qr attention mechanism.
-
use_binary_cls: whether to use binary rewriting classifier.
More detailed explanation of other arguments can be found in utils/utils.py
.
@inproceedings{tseng-etal-2021-cread,
title = "{CREAD}: Combined Resolution of Ellipses and Anaphora in Dialogues",
author = "Tseng, Bo-Hsiang and
Bhargava, Shruti and
Lu, Jiarui and
Moniz, Joel Ruben Antony and
Piraviperumal, Dhivya and
Li, Lin and
Yu, Hong",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.naacl-main.265",
pages = "3390--3406",
}
The code in this repository is licensed according to the LICENSE file.
Please contact bht26@cam.ac.uk or hong_yu@apple.com, or raise an issue in this repository.