Official implementation for the EMNLP-2022 paper "SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser". [Paper Link]
If you think our work is helpful, please cite the following paper:
@inproceedings{zhang2022syngec,
author = {Zhang, Yue and Zhang, Bo and Li, Zhenghua and Bao, Zuyi and Li, Chen and Zhang, Min},
title = {SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser},
booktitle = {Proceedings of EMNLP},
year = {2022}
}
We propose an approach named SynGEC to incorporate adapted dependency syntax knowledge into GEC models. The key idea is adjusting vanilla dependency parsers to accommodate ungrammatical sentences. To achieve this goal, we first extend the standard syntax representation scheme to use a unified tree structure to encode both grammatical errors and syntactic structure. Then we obtain high-quality parse trees of ungrammatical sentences by projecting target-side trees into source-side ones in parallel GEC training data, which are ultimately used for training a tailored parser named GOPar.
To encode parse trees produced by GOPar, we present a DepGCN-based GEC model built on Transformer. Experiments on mainstream datasets in two languages (English & Chinese) show that SynGEC is effective and achieves competitive results.
You can use the following commands to install the environment for SynGEC:
conda create -n syngec python==3.8
conda activate syngec
pip install -r requirements.txt
python -m spacy download en
cd src/src_syngec/fairseq-0.10.2
pip install --editable ./
The SynGEC model for GEC is based on fairseq-0.10.2.
The GOPar model for parsing ungrammatical inputs is based on SuPar.
Please turn to their repos for more instructions ~
|-- bash # Some scripts to reproduce our results
| |-- chinese_exp
| `-- english_exp
|-- data # Data files (mainly parallel sentence files)
| |-- bea19_dev
| |-- bea19_test
| |-- clang8_train
| |-- conll14_test
| |-- dicts
| |-- error_coded_train
| |-- hsk+lang8_train
| |-- hsk_train
| |-- mucgec_dev
| |-- mucgec_test
| `-- wi_locness_train
|-- model # Model checkpoints for GOPar and SynGEC
| |-- gopar
| `-- syngec
|-- pics # Pictures
|-- preprocess # Preprocessed binary files for fairseq training
|-- pretrained_weights # Pretrained language models, e.g., BART
|-- src # Main codes of our GOPar and SynGEC models
| |-- src_gopar
| `-- src_syngec
| |-- fairseq-0.10.2 # Our modified Fairseq. Specifically, we modify their trainer, modules, datasets, etc.
| `-- syngec_model # Main model files for SynGEC
`-- utils # Some important tools, including tree projection codes
You can download the following converged checkpoints and change the model path in bash files like ./bash/*_exp/generate_*
to generate GEC results.
English Models:
Name | Download Link | Description |
---|---|---|
Transformer-en | Link | Transformer-based GEC Model |
SynGEC-en | Link | SynGEC Model based on Transformer |
BART-en | Link | BART-based GEC Model |
SynGEC-BART-en | Link | SynGEC Model Enhanced with BART |
Chinese Models:
Name | Download Link | Description |
---|---|---|
Transformer-zh | Link | Transformer-based GEC Model |
SynGEC-zh | Link | SynGEC Model based on Transformer |
BART-zh | Link | BART-based GEC Model |
SynGEC-BART-zh | Link | SynGEC Model Enhanced with BART |
For example, you can download the Transformer-en
model and rename and put it at ./model/syngec/english_transformer_baseline.pt
. Then you can run the following script to generate results for CoNLL-14 dataset:
PROCESSED_DIR=./preprocess/english_clang8_with_syntax_transformer
BEAM=10
N_BEST=1
OUTPUT_DIR=./results
FAIRSEQ_DIR=./src_syngec/fairseq-0.10.2/fairseq_cli
MODEL_PATH=./model/syngec/english_transformer_baseline.pt
echo "Generating CoNLL14..."
SECONDS=0
CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python -u ${FAIRSEQ_DIR}/interactive.py $PROCESSED_DIR/bin \
--user-dir ./src/src_syngec/syngec_model \
--task syntax-enhanced-translation \
--path ${MODEL_PATH} \
--beam ${BEAM} \
--nbest ${N_BEST} \
-s src \
-t tgt \
--buffer-size 5000 \
--batch-size 32 \
--num-workers 12 \
--log-format tqdm \
--remove-bpe \
--fp16 \
< $OUTPUT_DIR/CoNLL14.src.bpe
echo "Generating Finish!"
duration=$SECONDS
echo "$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed."
cat $OUTPUT_DIR/CoNLL14.out.nbest | grep "^D-" | python -c "import sys; x = sys.stdin.readlines(); x = ''.join([ x[i] for i in range(len(x)) if (i % ${N_BEST} == 0) ]); print(x)" | cut -f 3 > $OUTPUT_DIR/CoNLL14.out
sed -i '$d' $OUTPUT_DIR/CoNLL14.out
python ./utils/post_process_english.py $OUTPUT_DIR/CoNLL14.src $OUTPUT_DIR/CoNLL14.out $OUTPUT_DIR/CoNLL14.out.post_processed
For SynGEC models, you need to first preprocess the syntactic information of test-sets as described in ./bash/*_exp/preprocess_*.sh
, and then generate results like this:
PROCESSED_DIR=./preprocess/english_clang8_with_syntax_transformer
BEAM=10
N_BEST=1
OUTPUT_DIR=./results
FAIRSEQ_DIR=./src_syngec/fairseq-0.10.2/fairseq_cli
MODEL_PATH=./model/syngec/english_transformer_syngec.pt
echo "Generating CoNLL14..."
SECONDS=0
CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python -u ${FAIRSEQ_DIR}/interactive.py $PROCESSED_DIR/bin \
--user-dir ./src/src_syngec/syngec_model \
--task syntax-enhanced-translation \
--path ${MODEL_DIR}/checkpoint_best.pt \
--beam ${BEAM} \
--nbest ${N_BEST} \
-s src \
-t tgt \
--buffer-size 5000 \
--batch-size 32 \
--num-workers 12 \
--log-format tqdm \
--remove-bpe \
--fp16 \
--conll_file $CoNLL14_TEST_BIN_DIR/test.conll.src-tgt.src\
--dpd_file $CoNLL14_TEST_BIN_DIR/test.dpd.src-tgt.src \
--probs_file $CoNLL14_TEST_BIN_DIR/test.probs.src-tgt.src \
--output_file $OUTPUT_DIR/CoNLL14.out.nbest \
< $OUTPUT_DIR/CoNLL14.src.bpe
echo "Generating Finish!"
duration=$SECONDS
echo "$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed."
We also provide our fine-tuned GEC-oriented parser (GOPar), which can jointly parse the trees of ungrammatical sentences and identitfy underlying grammatical errors.
Name | Download Link | Description |
---|---|---|
biaffine-dep-electra-en-gopar | Link | GOPar for English |
biaffine-dep-electra-zh-gopar | Link | GOPar for Chinese |
biaffine-dep-electra-zh-char | Link | Off-the-shelf char-level parser for Chinese (CTB) |
You can use ./src_gopar/parse.py
to parse with downloaded GOPar checkpoints.
If you want to train new models using your own dataset, please follow the instructions in ./bash/*_exp
:
-
pipeline_gopar.sh
: train GOPar; -
preprocess_syngec_*.sh
: preprocess data for training GEC models; -
train_syngec_*.sh
: train baseline & SynGEC models; -
generate_syngec_*.sh
: generate results (CoNLL14 and BEA19 for English, NLPCC18 and MuCGEC for Chinese);
You can also download the preprocessed data files from this Link. Note that you must get their licenses first!
Each data folder should contain the following files:
- src.txt
- tgt.txt (except test-sets)
Each line of these files should contain a sample (a line of source/target sentence).
If you have any questions, feel free to drop me an issue or contact me at hillzhang1999@qq.com.