This repo contains codes, mined corpora, and model checkpoints for the paper "An Unsupervised Method for Building Sentence Simplification corpora in Multiple Languages".
Language | # Samples | Link |
---|---|---|
English | 816,058 | Download |
French | 621,937 | Download |
Spanish | 487,862 | Download |
Model Architecture | Language | Link |
---|---|---|
Transformer | English | Download |
ConvS2S | English | Download |
BART | English | Download Download |
The model output files could be found in ./sys_outputs/
directory.
This project is built with standard sentence simplification suite EASSE and sequence modeling toolkit fairseq. Owing to that these two repo are still in fast developing, we strongly recommend you to use the same version of packages that we use for reproducing our work.
We provide two methods to install the required dependencies:
- python==3.7
- torch==1.7.1
pip install -r requirements.txt
-
Download the source code of dependencies from our OSS bin:
wget -O "easse-master.zip" https://lxylab.oss-cn-shanghai.aliyuncs.com/Trans-SS/dependencies/easse-master.zip wget -O "fairseq.tar.gz" https://lxylab.oss-cn-shanghai.aliyuncs.com/Trans-SS/dependencies/fairseq.tar.gz
-
Build from the source code:
tar -xzvf fairseq.tar.gz cd fairseq/ pip install -e ./
unzip easse-master.zip cd easse-master/ pip install -e ./
Additionally, the C++ API of fastBPE is also needed for word segmentation.
cd fastBPE/
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
You can also watch the visualized training process using tensorboard.
pip install tensorboard
Before running the mining/training scripts, please download the models and corpora that this repo depends on:
python ./prepare_resources.py
This repo may still contain bugs and we are working on improving the reproductivity. Welcome to open an issue or submit a Pull Request to report/fix the bugs.
Run the following scripts to back translate three multilingual NMT corpora:
./translate_de-en.sh
./translate_en-es.sh
./translate_en-fr.sh
Note that this process is very time-consuming, it took us several days to do back translation on a single NIVIDIA RTX 3090. The intermediate results can be download via running the resource preparation script.
Run the following python scripts to get the Sentence Simplification corpora:
python ./extract.py
python ./extract_fr.py
python ./extract_es.py
The training scripts function as they are named. For example, you can train a transformer model on Wikilarge via running:
./train_transformer_wikilarge.sh
and train transformer on our mined corpora via:
./train_transformer_trans-1M.sh
The checkpoints are stored in the directory ./checkpoints/
and logs for tensorboard can be found in ./logs/tensorboard
.
To evaluate the trained models, you can run test.py
. The testing logs will be generated in the model checkpoints directory.
usage: test.py [-h] --model-name MODEL_NAME --dataset-name DATASET_NAME --task-name TASK_NAME --bpe BPE [--source-lang SOURCE_LANG] [--target-lang TARGET_LANG]
[--test-dataset TEST_DATASET] [--do-lower-case] [--eval-batch-size EVAL_BATCH_SIZE] [--num-train-epochs NUM_TRAIN_EPOCHS] [--no-cuda] [--overwrite-output-dir]
[--show-eval-detail] [--eval-all-ckpt] [--fp16] [--tokenizer TOKENIZER] [--gpt2-encoder-json GPT2_ENCODER_JSON] [--gpt2-vocab-bpe GPT2_VOCAB_BPE]
[--fairseq-task FAIRSEQ_TASK] [--sentencepiece-model SENTENCEPIECE_MODEL] [--bpe-codes BPE_CODES]
If you find our corpora or paper useful, please consider citing:
@inproceedings{lu-etal-2021-unsupervised-method,
title = "An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages",
author = "Lu, Xinyu and
Qiang, Jipeng and
Li, Yun and
Yuan, Yunhao and
Zhu, Yi",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.22",
pages = "227--237",
abstract = "The availability of parallel sentence simplification (SS) is scarce for neural SS modelings. We propose an unsupervised method to build SS corpora from large-scale bilingual translation corpora, alleviating the need for SS supervised corpora. Our method is motivated by the following two findings: neural machine translation model usually tends to generate more high-frequency tokens and the difference of text complexity levels exists between the source and target language of a translation corpus. By taking the pair of the source sentences of translation corpus and the translations of their references in a bridge language, we can construct large-scale pseudo parallel SS data. Then, we keep these sentence pairs with a higher complexity difference as SS sentence pairs. The building SS corpora with an unsupervised approach can satisfy the expectations that the aligned sentences preserve the same meanings and have difference in text complexity levels. Experimental results show that SS methods trained by our corpora achieve the state-of-the-art results and significantly outperform the results on English benchmark WikiLarge.",
}
Some code in this repo is based on access. Thank for its wonderful works.