Japanese Grammatical Error Correction - JGEC

JGEC is described in the paper GECToR -Grammatical Error Correction: Tag, Not Rewrite, but it is implemented for Japanese. This project's code is based on the official implementation gector.

Model Architecture

The model consists of a bert-base-japanese and two linear classification heads, one for labels and one for detect.

labels predicts a specific edit transformation ($KEEP, $DELETE, $APPEND_x, etc), and detect predicts whether the token is CORRECT or INCORRECT. The results from the two are used to make a prediction. The predicted transformations are then applied to the errorful input sentence to obtain a corrected sentence.

Furthermore, in some cases, one pass of predicted transformations is not sufficient to transform the errorful sentence to the target sentence. Therefore, we repeat the process again on the result of the previous pass of transformations, until the model predicts that the sentence no longer contains incorrect tokens.

Inference using iterative sequence-tagging (https://www.grammarly.com/blog/engineering/gec-tag-not-rewrite/)

Datasets

Japanese Wikipedia dump, extracted with WikiExtractor, synthetic errors generated using preprocessing scripts
- 19,841,767 training sentences
NAIST Lang8 Learner Corpora
- 6,066,306 training sentences (generated from 3,084,0376 original sentences)
PheMT, extracted from this paper
- 1,409 training sentences
BSD, extracted from this paper
- 47,814 training sentences
jpn-eng
- 98,507 training sentences
jpn-address
- 116,494 training sentences

Synthetically Generated Error Corpus

The JaWiki, Lang8, BSD, PheMT, jpn-eng, and jp_address are to synthetically generate errorful sentences, with a method similar to Awasthi et al. 2019, but with adjustments for Japanese. The details of the implementation can be found in the preprocessing code in this repository.

Training

Install the requirements:

pip install -r requirements.txt

The model was trained in Colab with GPUs on each corpus with the following hyperparameters (default is used if unspecified):

python ./utils/combine.py
python ./utils/preprocess.py
bash train.sh

Demo

from module import JGEC

obj = JGEC()
source_sents = ["そして10時くらいに、喫茶店でレーシャルとジョノサンとベルに会いました",
                "一緒にコーヒーを飲みながら、話しました。"]

res = obj(source_sents)

print("Results:", res)
# Results: ['そして10時くらいに、喫茶店でレーシャルとジョノサンとベルに会いました', 
#         '一緒にコーヒーを飲みながら、話しました。']

Inference

Trained weights can be downloaded here. The trained weights have been trained on all of the datasets mentioned above.

Extract model.zip to the ./utils/data/model directory. You should have the following folder structure:

JGEC/
  utils/
    data/
      model/
        checkpoint
        model_checkpoint.data-00000-of-00001
        model_checkpoint.index

After downloading and extracting the weights, the demo app can be run with the command

python main.py

You may need to pip install flask if Flask is not already installed.

Evaluation

The model can be evaluated with evaluate.py on a parallel sentences corpus. The evaluation corpus used was TMU Evaluation Corpus for Japanese Learners (TEC_JL), and the metric is GLEU score.

TEC-JL Results

Method	GLEU
Chollampatt and Ng, 2018	0.739
JGEC	0.860

Credit

jonnyli1125

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
images		images
templates		templates
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.yaml		app.yaml
evaluate.py		evaluate.py
main.py		main.py
model.py		model.py
module.py		module.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese Grammatical Error Correction - JGEC

Model Architecture

Datasets

Synthetically Generated Error Corpus

Training

Demo

Inference

Evaluation

TEC-JL Results

Credit

About

Releases

Packages

Languages

License

phkhanhtrinh23/japanese_spelling_correction

Folders and files

Latest commit

History

Repository files navigation

Japanese Grammatical Error Correction - JGEC

Model Architecture

Datasets

Synthetically Generated Error Corpus

Training

Demo

Inference

Evaluation

TEC-JL Results

Credit

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages