Pytorch BERT Pretrain / Finetuning

pytorch BERT Trainer using HuggingFace transformers

Requirements

python 3.6
pytorch 1.12
cuda 10.0
tensorflow 1.14 (for tensorboard)
pytorch_transformers
gluonnlp >= 0.6.0
apex (for mixed precision training)
flask (for using api)

Pretrained Korean Bert Model (ETRI or SKT)
Make directory pretrained_model and make sub directory like below

pretrained_model
├── etri
│   ├── bert_config.json
│   ├── pytorch_model.bin
│   ├── tokenization.py
│   └── vocab.korean.rawtext.list
└── skt
    ├── bert_config.json
    ├── pytorch_model.bin
    ├── tokenizer.model
    └── vocab.json

Datasets

한국어 단발성 대화 데이터셋(공포, 놀람, 분노, 슬픔, 중립, 행복, 혐오)
Any Dataset containing binary label(긍정, 부정)

Datasets should be in csv format which has two columns named 'Sentence' and 'Emotion'.
Or you can modify a few codes below in datasets.py to fit your own datasets

...
# line 50 - 58
def get_data(self, file_path):
    data = pd.read_csv(file_path)
    corpus = data['Sentence']
    label = None
    try:
        label = [self.label2idx[l] for l in data['Emotion']]
    except:
        pass
    return corpus, label
...

Usage

For maksed language model pretrain

$ python train_mlm.py\
        --pretrained_type="etri"

For text classification

$ python train_classification.py\
        --pretrained_type="etri"

Classification after further MLM pretrain

$ python train_classification.py\
        --pretrained_model_path=".../best_model.bin"

Use fp16 argument for mixed precision training

$ python train_classification.py\
        --fp16\
        --fp16_opt_level="O1"

Inference

$ python test.py\
    --pretrained_model_path="./data/korean_single_test.csv"

After inference, result file saved to /result folder.

/result/test_result.csv : predicted label for test data
/result/test_result.png : confusion matrix for test data

Result

Overall

Test Set(3,859)
Accuracy	57.69%
Macro F1	56.84%

F1 score for each Emotion

Emotion	F1
공포	60.00%
놀람	57.49%
분노	54.60%
슬픔	62.64%
중립	44.21%
행복	81.88%
혐오	37.04%

Confusion matrix

Simple Web Application with Flask

$ python app.py

Sad case	Happy case

Name	Name	Last commit message	Last commit date
Latest commit jhnlee update dockerfile Jan 10, 2020 e08311f · Jan 10, 2020 History 30 Commits
best_model	best_model	Upload best model	Dec 30, 2019
dockerfile	dockerfile	update dockerfile	Jan 10, 2020
font	font	update result	Dec 30, 2019
images	images	update readme	Dec 30, 2019
result	result	update result	Dec 30, 2019
static	static	Upload best model	Dec 30, 2019
templates	templates	update app.py	Dec 30, 2019
.gitattributes	.gitattributes	Upload best model	Dec 30, 2019
.gitignore	.gitignore	update gitignore	Dec 16, 2019
README.md	README.md	update how to inference	Dec 30, 2019
app.py	app.py	Upload best model	Dec 30, 2019
datasets.py	datasets.py	update api	Dec 26, 2019
model.py	model.py	add mlm trainer	Dec 7, 2019
optim.py	optim.py	add train_classification.py	Dec 6, 2019
requirements.txt	requirements.txt	Upload best model	Dec 30, 2019
test.py	test.py	update result	Dec 30, 2019
train_classification.py	train_classification.py	update result	Dec 30, 2019
train_mlm.py	train_mlm.py	change pad function	Dec 16, 2019
utils.py	utils.py	add result writer	Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pytorch BERT Pretrain / Finetuning

Requirements

Datasets

Usage

Result

Simple Web Application with Flask

About

Releases

Packages

Languages

jhnlee/pytorch-bert-korean

Folders and files

Latest commit

History

Repository files navigation

Pytorch BERT Pretrain / Finetuning

Requirements

Datasets

Usage

Result

Simple Web Application with Flask

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages