This is the official github repository for the paper: Towards Standardizing Korean Grammatical Error Correction: Datasets and Annotation
Code maintained by: Soyoung Yoon
2022.11 Update: We are planning to add a demo page which you can simply inference by our pretrained model. Please stay tuned!
2023.5.3 Update: Our paper is accepted to ACL 2023 (main)!!
2023.6.6 Update: Demo is launched!! 🚀 link
Dataset request form: link
Demo: link
Colab demo: link
Full list of model checkpoint files: link
import torch
from transformers import PreTrainedTokenizerFast
from transformers import BartForConditionalGeneration
tokenizer = PreTrainedTokenizerFast.from_pretrained('Soyoung97/gec_kr')
model = BartForConditionalGeneration.from_pretrained('Soyoung97/gec_kr')
text = '한국어는어렵다.'
raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]
corrected_ids = model.generate(torch.tensor([input_ids]),
max_length=128,
eos_token_id=1, num_beams=4,
early_stopping=True, repetition_penalty=2.0)
output_text = tokenizer.decode(corrected_ids.squeeze().tolist(), skip_special_tokens=True)
output_text
>>> '한국어는 어렵다.'
Special thanks to the KoBART-summarization repository (referenced from it)
pip3 install -r requirements.txt
Short summary:
You first need to run the above commands. Else, we recommend using the provided docker image.
To get the datasets, follow the instructions below.
To run KAGAS, go to KAGAS
and run the following sample code to see the result:
python3 parallel_to_m2_korean.py -orig sample_test_data/orig.txt -cor sample_test_data/corrected.txt -out sample_test_data/output.m2 -hunspell ./aff-dic
docker image used: msyoon8/default:gec
-
Download raw lang8 file
- Fill out this form which is from this download page from lang-8 learner corpora (raw format).
- From your email, download & unzip
lang-8-20111007-2.0.zip
(For mac, it is justunzip lang-8-20111007-2.0.zip
) - Set the environment variable
PATH_TO_EXTRACTED_DATA
as the path forlang-8-20111007-L1-v2.dat
. (ex:PATH_TO_EXTRACTED_DATA=./lang-8-20111007-L1-v2.dat
)
-
Get raw korean sentences and filter them by using our code
cd get_data
git clone https://github.com/tomo-wb/Lang8-NAIST-extractor.git
cd Lang8-NAIST-extractor/scripts
python extract_err-cor-pair.py -d $PATH_TO_EXTRACTED_DATA -l2 Korean > ../../lang8_raw.txt
cd ../../
git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install .
cd ..
python filter.py
after running this code, you will get kor_lang8.txt
on the current directory. (.\get_data
)
Full logs after running filter.py is as follows:
1. Load data
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 406274/406274 [00:00<00:00, 1217655.24it/s]
dataset length 406271
2. Remove white spaces, misfunctioned cases, and other languages
dataset length 323531
3. Drop rows of other languages
4. Calculate variables and apply conditions
using cached model. /home/GEC_Korean_Public/get_data/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece
dataset length 203604
5. Save cleaned up data
6187
lang8 counter: 192183
unique lang8 data: 192183
Length before levenshtein: 122641
Length after levenshtein: 13080
Done!
IMPORTANT The Kor-Learner corpus is made from the NIKL corpus, and it is only allowed to
use & distribute under non-commercial purposes. Please refer to the original agreement form
The Kor-Native corpus is made from the (1) The Center for Teaching and Learning for Korean, and (2) National Institute of Korean Language, and
it's also only allowed to use under non-commercial purposes.
You can get the downloadable link for the dataset after filling in this form (You need to log in to google)
We recommend you download the data and put it under get_data
directory.
For detailed code about building Kor-Learner, please refer to the directory: get_data/korean_learner
(original private
repository.
Disclaimer: The original model code was implemented years ago, so it may not work well in current CUDA environments. We recommend you build your own model using our dataset, or try it out by our demo page.
To reproduce the original experiments, please refer to this spreadsheet. It contains all the information about model checkpoints(link to google drive), command logs, and so on.
(Deprecated)
If you want to reproduce the original code and train them, please look at the codes under src/KoBART_GEC
. Please use the docker image msyoon8/default:gec
to run the original(deprecated) code. Please note that the original code is no longer maintained. Tips: You don't need to install savvihub and g2pk to run the training. (You can just delete the import part and related code)
Kobart transformers are referenced by this repository.
- It would be helpful if you run the
get_data/process_for_train.py
code. - Please use the same code to make union files and split them into train,test, and val.
>>> python3 process_for_training.py
Select actions from the following: (punct_split, merge_wiki, split_pairs, train_split, q, merge_files, reorder_m2, make_sample, make_union) : train_split
Please specify the whole pair data path: /home/GEC_Korean_Public/get_data/native.txt
Length = train: 12292, val: 2634, test: 2634, write complete
Select actions from the following: (punct_split, merge_wiki, split_pairs, train_split, q, merge_files, reorder_m2, make_sample, make_union) : q
Bye!
Outputs will be saved at outputs/
directory.
Sample code is listed below.
make native
# Or, type full commands:
# python3 run.py --data native --default_root_dir ../../output --max_epochs 10 --lr 3e-05 --SEED 0 --train_file_path ../../get_data/native_train.txt --valid_file_path ../../get_data/native_val.txt --test_file_path ../../get_data/native_test.txt
In order to evaluate on m2 file, you need to have the m2 file of format {data}_{mode}.m2
(e.g. native_val.m2) at get_data
.
You should run KAGAS first on the data to make corresponding m2 file.
For the gleu module, we modified the official code to enable import & work on python3. It is at src/eval
.
For m2scorer, we used the python3 compatible version of m2scorer.
Please note that running the m2 scorer may take a really long time (hours).
Evaluation is automatically done for each epoch once you run the training code.
Based on the original ERRANT repository We modified some parts of the code to work for Korean.
|__ alignment_output/ Output examples from
|__ ALL/ --> WS & PUNCT & SPELL edits
|__ allMerge/ --> Allmerge version of alignment
|__ allSplit/ --> AllSplit version of alignment
|__pos/ --> pos splitted sample log by kkma
|__scripts/
|__ extractPos/ --> The original implemented version by @JunheeCho - modules which extract POS information from sentences
|__ spellchecker/ --> Contains example script that runs the korean hunspell and the outputs.
|__ align_text.py --> ERRANT version of align_text.py that has English version of linguistic cost.
|__ align_text_korean.py --> [IMPORTANT] korean version of align_text.py. Main function where it classify/align/merge Edits
|__ rdlextra.py --> Implementation of Damerau-Levenshtein Distance. (Adapted from the initial version of ERRANT)
|__Makefile --> simple script that helps automatically generate output into alignment_output. Most of times type "make debug"
|__parallel_to_m2.py --> Original alignment code from ERRANT.
|__parallel_to_m2_korean.py --> [IMPORTANT] korean version of parallel_to_m2.py. Starting function. Process files and outputs m2 file.
|__sample_test_data/ --> simple test file that you can give as input to parallel_to_m2 or parallel_to_m2_korean.py. Contains original and corrected sample files.
|__ws_error_dist.py --> Code that is used to generate the distribution of word space errors by kkma.
|__ws_error/ --> Sample outputs from ws_error_dist.py
All scripts run properly when you run it currently in this directory(KAGAS/)
The main starting point of function is parallel_to_m2_korean.py
. For Korean, we mainly modify parallel_to_m2_korean.py
and align_text_korean.py
.
python3 parallel_to_m2_korean.py -orig sample_test_data/orig.txt -cor sample_test_data/corrected.txt -out sample_test_data/output.m2 -hunspell ./aff-dic
Additional explanations:
- If you add
-save
, it means that m2 output files will be saved at -out path. Otherwise, it will just be printed out. - If you add
-logset
, it means that you want to only log(print) specific types. You can put multiple logset types e.g.-logset WS SPELL PUNCT
. If you omit-logset
, it returns all possible cases.
From this github release,
Korean dictionary ko-aff and ko-dic are downloadable.
You should put them into src/errant/edit-extraction/aff-dic/
, and save them as ko.aff
and ko.dic
.
In our repository, we already downloaded hunspell dictionary and put them inside /KAGAS/aff-dic/
directory, so you don't need to additionally download them.
when you run KAGAS, give the path as the following example:
python3 parallel_to_m2_korean.py ..... -hunspell $DIRECTORY_WHERE_HUNSPELL_LIBRARIES_ARE_SAVED ...
For our case, the code would be:
python3 parallel_to_m2_korean.py ..... -hunspell ./aff-dic...
- For MacOS, follow the directions from this blog post
- For linux/others, it is highly recommended to use the Cython wrapper on Hunspell dictionary, since installation is not easy. For the wrapper version, you just need to install the following:
pip3 install cyhunspell
It would have already been installed if you previously ran requirements.txt
.
You may need to install java (if not already) to run KKma. You can try
apt-get install openjdk-8-jdk
For modifying alignment costs, you should look at getAutoAlignedEdits
and token_substitution_korean
from scripts/align_text_korean.py
.
If you want to classify currently "NA" typed edits, you should edit classify_output()
from class ErrorAnalyzer()
located in scripts/align_text_korean.py
.
Currently, all error types except "WS"(Word Space) are classified here. Since word space edit needs merging, it is
specially handled beforehand at merge_ws
from scripts/align_text_korean.py
.
If you have any questions, suggestions or bug reports, you can contact the authors at: soyoungyoon at kaist.ac.kr
Our code follows the modified MIT License, followed by KoBART and ERRANT.
Modified MIT License
Software Copyright (c) 2022 Soyoung Yoon
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
associated documentation files (the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
The above copyright notice and this permission notice need not be included
with content created by the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
OR OTHER DEALINGS IN THE SOFTWARE.
If you find this useful, please consider citing our paper:
@article{yoon2022towards,
title={Towards Standardizing Korean Grammatical Error Correction: Datasets and Annotation},
author={Yoon, Soyoung and Park, Sungjoon and Kim, Gyuwan and Cho, Junhee and Park, Kihyo and Kim, Gyu Tae and Seo, Minjoon and Oh, Alice},
journal={arXiv preprint arXiv:2210.14389},
year={2022}
}