C-LLM

This is the source code for the paper "C-LLM: Learn to Check Chinese Spelling Errors Character by Character" （https://arxiv.org/pdf/2406.16536 ）

[2024.9.20] Our paper is accepted by EMNLP2024 Main!

Environment

Python: 3.8
Cuda: 12.0 (NVIDIA A100-SXM4-40GB)
Packages: pip install -r requirements.txt

Data

Train

Data for Continued Pre-training: Tiger-pretrain-zh (https://huggingface.co/datasets/TigerResearch/pretrain_zh)
Data for Supervised Fine-tuning (see /dataset/train_date/):
- Train: Wang271K, CSCD-NS (train)
- Dev: CSCD-NS (dev)

Test

See /dataset/test_date/:

General Dataset: CSCD-NS (test)
Multi-Domain Dataset: LEMON (https://github.com/gingasan/lemon/tree/main/lemon_v2)

Character-Level Tokenization

First, run tokenizer_prune_qwen.py to trim the vocabulary for BPE-based tokenization. Next, run pruner.py to update the model embeddings with the new vocabulary.

python tokenizer_prune_qwen.py 
python pruner.py

Continued Pre-training

The training data comprised approximately 19B tokens, but we trained for 30,000 steps, covering about 2B tokens. The backbone model is QWEN1.5.

Supervised Fine-tuning

After the above steps are completed, run train.sh for fine-tuning.

sh train.sh

Inference

After fine-tuning, run test.sh for inference. Please modify the parameter path in the script is updated to match the path where you have saved the parameters.

bash test.sh

Evaluation

Two methods for handling unequal length sentences were designed: one based on CheRRANT and the other on difflib. In this paper, we adopted the CheRRANT-based method. For evaluation, CheRRANT must first be downloaded to the specified directory.

Run evaluate_result.py for evaluation:

python evaluate_result.py

The script for calculating metrics is adapted from CSCD-NS.

Citation

If you find this work is useful for your research, please cite the following paper: C-LLM: Learn to Check Chinese Spelling Errors Character by Character （https://arxiv.org/pdf/2406.16536 ）

@inproceedings{li2024c,
  title={C-LLM: Learn to Check Chinese Spelling Errors Character by Character},
  author={Li, Kunting and Hu, Yong and He, Liang and Meng, Fandong and Zhou, Jie},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={5944--5957},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
dataset		dataset
models/common/ds_config		models/common/ds_config
README.md		README.md
calcuate_metric.py		calcuate_metric.py
evaluate_result.py		evaluate_result.py
llm_train_util.py		llm_train_util.py
model_test.py		model_test.py
paper.jpg		paper.jpg
pruner.py		pruner.py
requirements.txt		requirements.txt
test.sh		test.sh
tokenizer_prune_qwen.py		tokenizer_prune_qwen.py
train.sh		train.sh
train_new.py		train_new.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C-LLM

Environment

Data

Train

Test

Character-Level Tokenization

Continued Pre-training

Supervised Fine-tuning

Inference

Evaluation

Citation

About

Releases

Packages

Languages

ktlKTL/C-LLM

Folders and files

Latest commit

History

Repository files navigation

C-LLM

Environment

Data

Train

Test

Character-Level Tokenization

Continued Pre-training

Supervised Fine-tuning

Inference

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages