This repository contains the data and code for the following paper:
EmailSum: Abstractive Email Thread Summarization
@inproceedings{zhang2021emailsum,
title={EmailSum: Abstractive Email Thread Summarization},
author={Zhang, Shiyue and Celikyilmaz, Asli and Gao, Jianfeng and Bansal, Mohit},
booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
year={2021}
}
We only release the summaries we collected and provide scripts to extract email threads from flat email corpus (Avocado or W3C), because Avocado's copyright is protected by Linguistic Data Consortium.
- Python 3
- requirements.txt
- Download Avocado Research Email Collection from LDC
We collected the summaries of 2,549 Avocado email threads (see Avocado/summaries/EmailSum_data.json). We collected one more reference for each of the 500 email threads in the testing set after submission (see Avocado/summaries/one_more_reference.json).
-
First, cd Avocado/
-
Download "emails.json" from here and put it under Avocado/
-
Extract threads, assuming $ROOT_DIR contains the LDC2015T03 (i.e., $ROOT_DIR/LDC2015T03/Data/avocado-1.0.2)
python extract_threads.py --root_dir $ROOT_DIR
You will get "Avocado.json" which contains all extracted threads.
- Anonymize & Filter
python anonymize.py
After this step, you can see cleaned threads under "Avocado_threads/".
- Prepare Train/Dev/Test files
python standardize.py
After this step, you can see experimental files under "exp_data/". There are two sub-directories: "data_email_short" and "data_email_long" for short and long summary, respectively. Each line of the *.source file is one email thread, in which emails are separated by "|||".
We provide the code for extracting threads from W3C email corpus for semi-supervised learning.
-
First, cd "W3C/"
-
Download raw data files from here and put them under "W3C/raw_data/"
-
Extract threads
python extract_threads.py
You will get "W3C.json" which contains all extracted threads.
- Anonymize & Filter
python anonymize.py
After this step, you can see all cleaned thread under "W3C_threads/".
- Python 3
- PyTorch 1.7, transformers==2.11.0
- Download pre-trained models from here, decompress, and put them under "train/".
Note that we conduct model selection for each metric, so there are multiple best checkpoints, e.g., "checkpoint-rouge1" is the best ROUGE1 checkpoint selected by ROUGE1 on development set. "best_ckpt.json" contains the best scores on development set.
- Prepare data
After you get "Avocado/exp_data/data_email_short" and "Avocado/exp_data/data_email_long", run
python3 data.py --data_dir Avocado/exp_data/data_email_long --cache_dir train/cache --max_output_length 128
python3 data.py --data_dir Avocado/exp_data/data_email_short --cache_dir train/cache --max_output_length 56
- Test
T5 baselines
python3 run.py --task email_long --data_dir Avocado/exp_data/data_email_long/ --test_only --max_output_length 128
python3 run.py --task email_short --data_dir Avocado/exp_data/data_email_short/ --test_only --max_output_length 56
Hierarchical T5
python3 run.py --task email_long --memory_type ht5 --data_dir Avocado/exp_data/data_email_long/ --test_only --max_output_length 128
python3 run.py --task email_short --memory_type ht5 --data_dir Avocado/exp_data/data_email_short/ --test_only --max_output_length 56
Semi-supervised models
python3 run.py --task email_long_w3c --data_dir Avocado/exp_data/data_email_long/ --test_only --max_output_length 128
python3 run.py --task email_short_together --data_dir Avocado/exp_data/data_email_short/ --test_only --max_output_length 56
The testing scores will be saved in "best_ckpt_test.json". We provide "best_ckpt_test_verification.json" for verification of results, almost the same numbers should be obtained.
We also provide "best_ckpt_test_old.json" that contains our previously tested scores (reported in the paper). You are likely to get slightly different numbers from "best_ckpt_test_old.json" because we added a few more data clean and anonymization rules. The pre-processed *.source files will be slightly different from the ones we used before.
- Test with two references
Just add "--two_ref", e.g.,
python3 run.py --task email_long --data_dir Avocado/exp_data/data_email_long/ --test_only --two_ref --max_output_length 128
The testing scores will be saved in "best_ckpt_test_2ref.json". We provide "best_ckpt_test_2ref_verification.json" for verification of results, almost the same numbers should be obtained.
One-reference results:
EmailSum Short | rouge1 | rouge2 | rougeL | rougeLsum | BERTScore |
---|---|---|---|---|---|
T5 base | 36.61 | 10.58 | 28.29 | 32.77 | 33.92 |
HT5 | 36.30 | 10.74 | 28.52 | 33.33 | 33.49 |
Semi-sup. (together) | 36.99 | 11.22 | 28.71 | 33.70 | 33.91 |
EmailSum Long | rouge1 | rouge2 | rougeL | rougeLsum | BERTScore |
---|---|---|---|---|---|
T5 base | 43.87 | 14.10 | 30.50 | 39.91 | 32.07 |
HT5 | 44.44 | 14.51 | 30.86 | 40.24 | 32.31 |
Semi-sup. (w3c) | 44.58 | 14.64 | 31.40 | 40.73 | 32.80 |
Two-reference results (average the results of two references):
EmailSum Short | rouge1 | rouge2 | rougeL | rougeLsum | BERTScore |
---|---|---|---|---|---|
T5 base | 35.22 | 9.60 | 27.08 | 31.22 | 32.45 |
HT5 | 34.81 | 9.82 | 27.28 | 31.74 | 32.42 |
Semi-sup. (together) | 35.52 | 10.35 | 27.29 | 33.11 | 32.24 |
EmailSum Long | rouge1 | rouge2 | rougeL | rougeLsum | BERTScore |
---|---|---|---|---|---|
T5 base | 43.41 | 13.81 | 29.97 | 39.32 | 31.58 |
HT5 | 43.86 | 14.06 | 30.17 | 39.64 | 31.84 |
Semi-sup. (w3c) | 43.99 | 14.18 | 30.56 | 40.12 | 32.04 |
Interestingly, we always get lower scores when comparing to the 2nd reference we collected after paper submission. That's why two-reference results are always worse than one-reference ones. It may be caused by the different set of turkers involved in summary annotation that brings domain shift.
Just drop "--test_only", e.g.,
python3 run.py --task email_long --data_dir Avocado/exp_data/data_email_long/ --max_output_length 128