Skip to content

Pytorch BERT implementation for additional experiment

Notifications You must be signed in to change notification settings

jhnlee/pytorch-bert-korean

Repository files navigation

Pytorch BERT Pretrain / Finetuning

pytorch BERT Trainer using HuggingFace transformers

Requirements

  • python 3.6
  • pytorch 1.12
  • cuda 10.0
  • tensorflow 1.14 (for tensorboard)
  • pytorch_transformers
  • gluonnlp >= 0.6.0
  • apex (for mixed precision training)
  • flask (for using api)

Pretrained Korean Bert Model (ETRI or SKT)
Make directory pretrained_model and make sub directory like below

pretrained_model
β”œβ”€β”€ etri
β”‚Β Β  β”œβ”€β”€ bert_config.json
β”‚Β Β  β”œβ”€β”€ pytorch_model.bin
β”‚Β Β  β”œβ”€β”€ tokenization.py
β”‚Β Β  └── vocab.korean.rawtext.list
└── skt
    β”œβ”€β”€ bert_config.json
    β”œβ”€β”€ pytorch_model.bin
    β”œβ”€β”€ tokenizer.model
    └── vocab.json

Datasets

Datasets should be in csv format which has two columns named 'Sentence' and 'Emotion'.
Or you can modify a few codes below in datasets.py to fit your own datasets

...
# line 50 - 58
def get_data(self, file_path):
    data = pd.read_csv(file_path)
    corpus = data['Sentence']
    label = None
    try:
        label = [self.label2idx[l] for l in data['Emotion']]
    except:
        pass
    return corpus, label
...

Usage

For maksed language model pretrain

$ python train_mlm.py\
        --pretrained_type="etri"

For text classification

$ python train_classification.py\
        --pretrained_type="etri"

Classification after further MLM pretrain

$ python train_classification.py\
        --pretrained_model_path=".../best_model.bin"

Use fp16 argument for mixed precision training

$ python train_classification.py\
        --fp16\
        --fp16_opt_level="O1"

Inference

$ python test.py\
    --pretrained_model_path="./data/korean_single_test.csv" 

After inference, result file saved to /result folder.

  • /result/test_result.csv : predicted label for test data
  • /result/test_result.png : confusion matrix for test data

Result

Overall

Test Set(3,859)
Accuracy 57.69%
Macro F1 56.84%

F1 score for each Emotion

Emotion F1
곡포 60.00%
λ†€λžŒ 57.49%
λΆ„λ…Έ 54.60%
μŠ¬ν”” 62.64%
쀑립 44.21%
행볡 81.88%
혐였 37.04%

Confusion matrix

Simple Web Application with Flask

$ python app.py
Sad case Happy case

About

Pytorch BERT implementation for additional experiment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published