@@ -8,13 +8,123 @@ Pytorch implementation of Google AI's 2018 BERT, with simple annotation
8
8
9
9
## Introduction
10
10
11
- Currently WIP, with very high speed :)
12
- But it might be takes some days to validate my code
11
+ Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA),
12
+ including outperform the human F1 score on SQuAD v1.1 QA task.
13
+ This paper proved that Transformer(self-attention) based encoder can be powerfully used as
14
+ alternative of previous language model with proper language model training method.
15
+ And more importantly, they showed us that this pre-trained language model can be transfer
16
+ into any NLP task without making task specific model architecture.
13
17
14
- If you have any comment or question about my code, please leave it to issue.
15
- I'll reply back as soon as possible .
18
+ This amazing result would be record in NLP history,
19
+ and I expect many further papers about BERT will be published very soon .
16
20
17
- Thank you
21
+ This repo is implementation of BERT. Code is very simple and easy to understand fastly.
22
+ Some of these codes are based on [ The Annotated Transformer] ( http://nlp.seas.harvard.edu/2018/04/03/attention.html )
23
+
24
+
25
+ ## Language Model Pre-training
26
+
27
+ In the paper, authors shows the new language model training methods,
28
+ which are "masked language model" and "predict next sentence".
29
+
30
+
31
+ ### Masked Language Model
32
+
33
+ > Original Paper : 3.3.1 Task #1 : Masked LM
34
+
35
+ ```
36
+ Input Sequence : The man went to [MASK] store with [MASK] dog
37
+ Target Sequence : the his
38
+ ```
39
+
40
+ #### Rules:
41
+ Randomly 15% of input token will be changed into something, based on under sub-rules
42
+
43
+ 1 . Randomly 80% of tokens, gonna be a ` [MASK] ` token
44
+ 2 . Randomly 10% of tokens, gonna be a ` [RANDOM] ` token(another word)
45
+ 3 . Randomly 10% of tokens, will be remain as same. But need to be predicted.
46
+
47
+ ### Predict Next Sentence
48
+
49
+ > Original Paper : 3.3.2 Task #2 : Next Sentence Prediction
50
+
51
+ ```
52
+ Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
53
+ Label : Is Next
54
+
55
+ Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
56
+ Label = NotNext
57
+ ```
58
+
59
+ "Is this sentence can be continuously connected?"
60
+
61
+ understanding the relationship, between two text sentences, which is
62
+ not directly captured by language modeling
63
+
64
+ #### Rules:
65
+
66
+ 1 . Randomly 50% of next sentence, gonna be continuous sentence.
67
+ 2 . Randomly 50% of next sentence, gonna be unrelated sentence.
68
+
69
+
70
+ ## Usage
71
+
72
+ ### 1. Building vocab based on your corpus
73
+ ``` shell
74
+ python build_vocab.py -c data/corpus.small -o data/corpus.small.vocab
75
+ ```
76
+ ``` shell
77
+ usage: build_vocab.py [-h] -c CORPUS_PATH -o OUTPUT_PATH [-s VOCAB_SIZE]
78
+ [-e ENCODING] [-m MIN_FREQ]
79
+
80
+ optional arguments:
81
+ -h, --help show this help message and exit
82
+ -c CORPUS_PATH, --corpus_path CORPUS_PATH
83
+ -o OUTPUT_PATH, --output_path OUTPUT_PATH
84
+ -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
85
+ -e ENCODING, --encoding ENCODING
86
+ -m MIN_FREQ, --min_freq MIN_FREQ
87
+
88
+ ```
89
+ ### 2. Building BERT train dataset with your corpus
90
+ ``` shell
91
+ python build_dataset.py -d data/corpus.small -v data/corpus.small.vocab -o data/dataset.small
92
+ ```
93
+
94
+ ``` shell
95
+ usage: build_dataset.py [-h] -v VOCAB_PATH -c CORPUS_PATH [-e ENCODING] -o
96
+ OUTPUT_PATH
97
+
98
+ optional arguments:
99
+ -h, --help show this help message and exit
100
+ -v VOCAB_PATH, --vocab_path VOCAB_PATH
101
+ -c CORPUS_PATH, --corpus_path CORPUS_PATH
102
+ -e ENCODING, --encoding ENCODING
103
+ -o OUTPUT_PATH, --output_path OUTPUT_PATH
104
+ ```
105
+
106
+ ### 3. Train your own BERT model
107
+ ``` shell
108
+ python train.py -d data/dataset.small -v data/corpus.small.vocab -o output/
109
+ ```
110
+ ``` shell
111
+ usage: train.py [-h] -d TRAIN_DATASET [-t TEST_DATASET] -v VOCAB_PATH -o
112
+ OUTPUT_DIR [-hs HIDDEN] [-n LAYERS] [-a ATTN_HEADS]
113
+ [-s SEQ_LEN] [-b BATCH_SIZE] [-e EPOCHS]
114
+
115
+ optional arguments:
116
+ -h, --help show this help message and exit
117
+ -d TRAIN_DATASET, --train_dataset TRAIN_DATASET
118
+ -t TEST_DATASET, --test_dataset TEST_DATASET
119
+ -v VOCAB_PATH, --vocab_path VOCAB_PATH
120
+ -o OUTPUT_DIR, --output_dir OUTPUT_DIR
121
+ -hs HIDDEN, --hidden HIDDEN
122
+ -n LAYERS, --layers LAYERS
123
+ -a ATTN_HEADS, --attn_heads ATTN_HEADS
124
+ -s SEQ_LEN, --seq_len SEQ_LEN
125
+ -b BATCH_SIZE, --batch_size BATCH_SIZE
126
+ -e EPOCHS, --epochs EPOCHS
127
+ ```
18
128
19
129
20
130
## Author
0 commit comments