JGEC is described in the paper GECToR -Grammatical Error Correction: Tag, Not Rewrite, but it is implemented for Japanese. This project's code is based on the official implementation gector.
The model consists of a bert-base-japanese and two linear classification heads, one for labels
and one for detect
.
labels
predicts a specific edit transformation ($KEEP
, $DELETE
, $APPEND_x
, etc), and detect
predicts whether the token is CORRECT
or INCORRECT
. The results from the two are used to make a prediction. The predicted transformations are then applied to the errorful input sentence to obtain a corrected sentence.
Furthermore, in some cases, one pass of predicted transformations is not sufficient to transform the errorful sentence to the target sentence. Therefore, we repeat the process again on the result of the previous pass of transformations, until the model predicts that the sentence no longer contains incorrect tokens.
Inference using iterative sequence-tagging (https://www.grammarly.com/blog/engineering/gec-tag-not-rewrite/)- Japanese Wikipedia dump, extracted with WikiExtractor, synthetic errors generated using preprocessing scripts
- 19,841,767 training sentences
- NAIST Lang8 Learner Corpora
- 6,066,306 training sentences (generated from 3,084,0376 original sentences)
- PheMT, extracted from this paper
- 1,409 training sentences
- BSD, extracted from this paper
- 47,814 training sentences
- jpn-eng
- 98,507 training sentences
- jpn-address
- 116,494 training sentences
The JaWiki, Lang8, BSD, PheMT, jpn-eng, and jp_address are to synthetically generate errorful sentences, with a method similar to Awasthi et al. 2019, but with adjustments for Japanese. The details of the implementation can be found in the preprocessing code in this repository.
Install the requirements:
pip install -r requirements.txt
The model was trained in Colab with GPUs on each corpus with the following hyperparameters (default is used if unspecified):
python ./utils/combine.py
python ./utils/preprocess.py
bash train.sh
from module import JGEC
obj = JGEC()
source_sents = ["そして10時くらいに、喫茶店でレーシャルとジョノサンとベルに会いました",
"一緒にコーヒーを飲みながら、話しました。"]
res = obj(source_sents)
print("Results:", res)
# Results: ['そして10時くらいに、喫茶店でレーシャルとジョノサンとベルに会いました',
# '一緒にコーヒーを飲みながら、話しました。']
Trained weights can be downloaded here. The trained weights have been trained on all of the datasets mentioned above.
Extract model.zip
to the ./utils/data/model
directory. You should have the following folder structure:
JGEC/
utils/
data/
model/
checkpoint
model_checkpoint.data-00000-of-00001
model_checkpoint.index
After downloading and extracting the weights, the demo app can be run with the command
python main.py
You may need to pip install flask
if Flask is not already installed.
The model can be evaluated with evaluate.py
on a parallel sentences corpus. The evaluation corpus used was TMU Evaluation Corpus for Japanese Learners (TEC_JL), and the metric is GLEU score.
Method | GLEU |
---|---|
Chollampatt and Ng, 2018 | 0.739 |
JGEC | 0.860 |