pytorch BERT Trainer using HuggingFace transformers
- python 3.6
- pytorch 1.12
- cuda 10.0
- tensorflow 1.14 (for tensorboard)
- pytorch_transformers
- gluonnlp >= 0.6.0
- apex (for mixed precision training)
- flask (for using api)
Pretrained Korean Bert Model (ETRI or SKT)
Make directory pretrained_model
and make sub directory like below
pretrained_model
βββ etri
βΒ Β βββ bert_config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ tokenization.py
βΒ Β βββ vocab.korean.rawtext.list
βββ skt
βββ bert_config.json
βββ pytorch_model.bin
βββ tokenizer.model
βββ vocab.json
- νκ΅μ΄ λ¨λ°μ± λν λ°μ΄ν°μ (곡ν¬, λλ, λΆλ Έ, μ¬ν, μ€λ¦½, ν볡, νμ€)
- Any Dataset containing binary label(κΈμ , λΆμ )
Datasets should be in csv format which has two columns named 'Sentence' and 'Emotion'.
Or you can modify a few codes below in datasets.py
to fit your own datasets
...
# line 50 - 58
def get_data(self, file_path):
data = pd.read_csv(file_path)
corpus = data['Sentence']
label = None
try:
label = [self.label2idx[l] for l in data['Emotion']]
except:
pass
return corpus, label
...
For maksed language model pretrain
$ python train_mlm.py\
--pretrained_type="etri"
For text classification
$ python train_classification.py\
--pretrained_type="etri"
Classification after further MLM pretrain
$ python train_classification.py\
--pretrained_model_path=".../best_model.bin"
Use fp16 argument for mixed precision training
$ python train_classification.py\
--fp16\
--fp16_opt_level="O1"
Inference
$ python test.py\
--pretrained_model_path="./data/korean_single_test.csv"
After inference, result file saved to /result
folder.
/result/test_result.csv
: predicted label for test data/result/test_result.png
: confusion matrix for test data
Overall
Test Set(3,859) | |
---|---|
Accuracy | 57.69% |
Macro F1 | 56.84% |
F1 score for each Emotion
Emotion | F1 |
---|---|
κ³΅ν¬ | 60.00% |
λλ | 57.49% |
λΆλ Έ | 54.60% |
μ¬ν | 62.64% |
μ€λ¦½ | 44.21% |
ν볡 | 81.88% |
νμ€ | 37.04% |
$ python app.py
Sad case | Happy case |
---|---|