This repository contains the split-and-rephrase benchmark and scripts from our EMNLP 2017 paper.
If you use our datasets, please cite the following paper:
Split and Rephrase, Shashi Narayan, Claire Gardent, Shay B. Cohen and Anastasia Shimorina, In the 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP), Copenhagen, Denmark (bib)
If you have any issue using this repository, please contact me at shashi.narayan@ed.ac.uk.
We have extracted this dataset from the complete version of the WebNLG data. It consists of following files:
-
final-complexsimple-meanpreserve-intreeorder-full.txt: Complex and Simple Sentences with their semantic identifiers.
-
webnlg-corpus-release: RDF triples related to each semantic identifier.
-
Split-train-dev-test.benchmark-v1.0.json: Train, Development and Test Splits.
-
benchmark-v0.1: It was extracted from an incomplete version of the WebNLG corpus with 8 DBPedia categories (Airport, Astronaut, Building, Food, Monument, SportsTeam, University, WrittenWork).
-
benchmark-v1.0: It is extracted from the final version of the WebNLG corpus with 15 DBPedia categories (Airport, Building, Food, SportsTeam, Artist, CelestialBody, MeanOfTransportation, University, Astronaut, City, Monument, WrittenWork, Athlete, ComicsCharacter, Politician).
Version | # distinct complex sentences | # complex-simple pairs with partitions | # complex-simple pairs without partitions |
---|---|---|---|
benchmark-v0.1 | 5546 | 1098221 | 1945 |
benchmark-v1.0 | 18830 | 1445159 | 6951 |
-
benchmark-v0.1: In this version, we followed standard practice in the Simplification literature. We ensured that complex sentences in validation and test sets are not seen during training by splitting the 5,546 distinct complex sentences into three subsets: Training set (4,438, 80%), Validation set (554, 10%) and Test set (554, 10%). This way of splitting does not guarantee that the RDF triples seen in the validation and test sets won't occur in the training set. As a result, it leads to a large n-gram overlaps between the training, validation and test sets.
-
benchmark-v1.0: In this version, we ensured that if an RDF triple t: (e1, r, e2) is seen in the validation or test set, it does not occur in the training set. However, e1 or r or e2 may have occurred in the training set with some other RDF triples. This automatically guarantees that the complex sentences in the validation and test sets are not seen in the training set.
Version | Training | Validation | Test |
---|---|---|---|
benchmark-v0.1 | 4438 | 554 | 554 |
benchmark-v1.0 | 16946 | 954 | 930 |
Version | benchmark-v0.1 | benchmark-v1.0 |
---|---|---|
RDFs (Train vs Test) | 672 (1025 vs 676) | 0 (3162 vs 352) |
RDFs (Train vs Valid) | 671 (1025 vs 675) | 0 (3162 vs 356) |
RDFs (Test vs Valid) | 501 (676 vs 675) | 338 (352 vs 356) |
Entities (Train vs Test) | 642 (908 vs 644) | 56 (2665 vs 357) |
Entities (Train vs Valid) | 634 (908 vs 636) | 56 (2665 vs 360) |
Entities (Test vs Valid) | 505 (644 vs 636) | 345 (357 vs 360) |
Properties (Train vs Test) | 139 (168 vs 140) | 122 (346 vs 138) |
Properties (Train vs Valid) | 132 (168 vs 133) | 119 (346 vs 137) |
Properties (Test vs Valid) | 120 (140 vs 133) | 134 (138 vs 137) |
Version | benchmark-v0.1 | benchmark-v1.0 |
---|---|---|
Total Simple Sentences | 9552 | 31159 |
Train | 8840 | 28150 |
Validation | 3765 | 2464 |
Test | 4015 | 2466 |
Train vs Test | 3606 | 3 |
Train vs Val | 3425 | 2 |
Test vs Val | 2210 | 1918 |
Train vs Test vs Val | 2173 | 2 |
This is the version of the dataset reported in our EMNLP paper. It was extracted from an incomplete version (then available) of the WebNLG data.
In this version, we had followed a standard practice in Simplification literature and split our dataset into the training, validation and test subsets such that complex sentences in validation and test sets were not seen during training.
Recently (Feb 18), Jan Botha and Jason Baldridge (Google), informed us that this way of splitting led to a large n-gram overlap between training, development and test sets. We found that this overlap appeared due to the shared RDF triples in our dataset. As a result, we have decided to deprecate the split used in the paper. Instead, we encourage others to use an improved version (benchmark-v1.0) of this dataset.
In case you would like to work with this version, we suggest you to use the split of Aharoni and Goldberg.
benchmark-v0.1 consists of following files:
-
final-complexsimple-meanpreserve-intreeorder-full.txt: Complex and Simple Sentences with their semantic identifiers.
-
benchmark_verified_simplifcation: RDF triples related to each semantic identifier.
-
Split-train-dev-test.DONT-CHANGE.json: Train, Development and Test Splits. (Removed)
and two additional directories:
-
"complex-sents" directory: Train, Development and Test complex sentences used as input during testing. (Removed)
-
"modtripleset-linealization" directory: Semantic identifier associated with their linearized RDF representation.
Our models use codes from Multiple Source NMT Toolkit, Zoph_RNN and our Hybrid Sentence Simplification System. To replicate all the models discussed in the paper, please make sure that you have these codes available.
python prepare-baseline-data.py
This parses "final-complexsimple-meanpreserve-intreeorder-full.txt" and "Split-train-dev-test.DONT-CHANGE.json" files and prepares data for three baseline models: baseline-seq2seq, baseline-seq2seq-multisrc and baseline-symbolic.
- baseline-seq2seq (SEQ2SEQ, C ==> S1, S2, S3): Training and decoding with ZOPH_RNN.
./ZOPH_RNN -t baseline-seq2seq/train.complex baseline-seq2seq/train.simple model.nn -N 3 -H 500 -m 64 -d 0.8 -l 0.5 --attention-model true --feed-input true -a baseline-seq2seq/validation.complex baseline-seq2seq/validation.simple -A 0.5 --tmp-dir-location baseline-seq2seq/fullvocab/ --logfile baseline-seq2seq/fullvocab/logfile.txt -B best.nn -M 1 1 1 1
./ZOPH_RNN -k 1 baseline-seq2seq/fullvocab/best.nn baseline-seq2seq/fullvocab/test.1best.txt --decode-main-data-files benchmark/complex-sents/test.complex
- baseline-seq2seq-multisrc (MULTISEQ2SEQ, C T_C ==> S1, S2, S3)
./ZOPH_RNN -n 30 -t baseline-seq2seq-multisrc/train.complex baseline-seq2seq-multisrc/train.simple baseline-seq2seq-multisrc/fullvocab/model.nn -N 3 -H 500 -m 64 -d 0.8 -l 0.5 --multi-source baseline-seq2seq-multisrc/train.complex-semantics.linearized baseline-seq2seq-multisrc/fullvocab/src.nn --attention-model 1 --feed-input 1 --multi-attention 1 -a baseline-seq2seq-multisrc/validation.complex baseline-seq2seq-multisrc/validation.simple baseline-seq2seq-multisrc/validation.complex-semantics.linearized -A 0.5 --tmp-dir-location baseline-seq2seq-multisrc/fullvocab/ --logfile baseline-seq2seq-multisrc/fullvocab/logfile.txt -B baseline-seq2seq-multisrc/fullvocab/best.nn -M 1 1 1 1
./ZOPH_RNN -k 1 baseline-seq2seq-multisrc/fullvocab/best.nn baseline-seq2seq-multisrc/fullvocab/test.1best.txt --decode-main-data-files benchmark/complex-sents/test.complex --decode-multi-source-data-files complex-sents/test.semantics.linearized --decode-multi-source-vocab-mappings baseline-seq2seq-multisrc/fullvocab/src.nn
Please use "extract-modtriple-linearized-tokenized-forafile.py" to generate ".linearized" file.
- baseline-symbolic (HYBRIDSIMPL, C ==> S1, S2, S3 using Boxer and SMT)
Please follow instructions from our Hybrid Sentence Simplification System. Please contact me at shashi.narayan@ed.ac.uk if you have any issue.
python prepare-learn-to-partition.py
It generates a directory called "mymodel/partition-module." Please have a look at our paper to use this data to learn a probabilistic model to learn to partition.
python prepare-learn-to-generation.py
It generates a directory called "mymodel/generation-module." Please use ZOPH_RNN codes (as in baseline models) to implement MULTISEQ2SEQ or SEQ2SEQ followed by the SPLIT step.
python prepare-evaluation-directories.py
This parses final-complexsimple-meanpreserve-intreeorder-full.txt and build evaluation directories for train, test and validation using Split-train-dev-test.DONT-CHANGE.json.
Follows: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl usage: multi-bleu.pl [-lc] reference < hypothesis\n"; Reads the references from reference or reference0, reference1, ...\n";
If more than one reference sentence, it generates multiple reference files in "evaluation-directories."
Finally use multi-bleu.perl to estimate BLEU scores.